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Abstract 

This  report  presents  the  theoretical  development  and 
numerical  implementation  of  a  procedure  for  approximating 
continuous  probability  density  functions  on  a  bounded  inter¬ 
val.  The  work  is  applicable  to  Bayesian  decision  models  in 
that  available  information  is  used  to  update  or  obtain  the 
prior  distribution.  The  procedure  is  based  on  the  solution 
of  a  constrained  entropy  maximization  problem  and  requires 
information  in  the  form  of  expected  values  of  "information 
functions."  The  approach  involves  three  steps:  estimation 
of  expected  (or  average)  values  of  "potential"  information 
functions,  selection  of  the  "active"  subset  of  functions  to 
define  the  approximation  family,  and  simultaneous  solution 
of  the  constraints  10  select  the  specific  approximating  den¬ 
sity  for  a  given  set  of  data. 

A  useful  set  of  potential  information  functions  is 
developed,  and  three  numerical  methods  for  active  set  selec¬ 
tion  are  demonstrated.  Numerical  techniques  for  expected 
value  computation  are  discussed,  and  a  scheme  for  solution 
of  the  constraints  is  developed  and  implemented.  Theoreti¬ 
cal  development  includes  theorems  on  form  and  uniqueness. 
Approximation  accuracy  is  related  to  potential  set  defini¬ 
tion  and  data  accuracy.  The  procedure  is  applied  to 
several  known  distributions  to  demonstrate  applicability. 
Applications  to  computer  simulation  and  interval  arithmetic 

models  are  demonstrated  with  specific  examples. 

xii 


CONTINUOUS  DENSITY  APPROXIMATION  ON  A  BOUNDED 


INTERVAL  USING  INFORMATION  THEORETIC  CONCEPTS 
Chapter  I_.  Introduction 

This  dissertation  concerns  the  representation  or 
approximation  of  unknown  probability  distributions.  The 
proposed  approximation  method  is  based  on  the  concept  of 
maximum  entropy  and  uses  known  or  calculable  information 
about  the  unknown  distributions.  The  work  was  motivated  by 
Bayesian  decision  models,  as  discussed  by  Tribus  (Ref  82) , 
in  that  available  information  is  used  to  update  a  prior 
estimate  of  the  unknown  distribution.  The  prior  estimate 
is  assumed  to  be  tne  uniform  distribution  or  is  represented 
by  a  random  sample  from  the  unknown  distribution.  The 
initial  estimate  is  updated  via  information  in  the  form  of 
expected  values  of  certain  "information  functions."  Selec¬ 
tion  of  the  information  functions  determines  the  form  and 
accuracy  of  the  approximating  distribution. 

The  word  "characterize”  carries  special  meaning,  for 
purposes  of  this  dissertation,  in  describing  the  accuracy  of 
approximation.  Our  use  of  the  word  is  here  defined  to  pre¬ 
clude  later  misinterpretation ,  and  because  our  use  is  dif¬ 
ferent  from  the  usual  statistical  meaning.  Assume  that  the 
unknown  distribution  is  generated  by  the  analytic  density 
function  f  (x) ,  and  let  p(x)  symbolize  the  approximating 
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density.  When  p(x)  is  reducible  to  the  exact  form  of  f (x) 
to  include  the  correct  parameter  values,  then  we  say  that 
p (x )  characterizes  f<x).  If  p(x)  is  of  a  different  form 
than  f  (x) ,  chen  p(x)  approximates  f(x).  Thus  "characterize" 
is  used  to  indicate  an  exact  representation  of  the  unknown 
analytic  density.  Given  this  definition,  we  wish  to  char¬ 
acterize  or  accurately  approximate  the  unknown  distribution. 

An  initial  concern  of  the  research  was  to  provide 
a  method  to  represent  the  output  distribution  of  a  computer 
simulation  in  the  interest  of  error  propagation  studies. 
Although  the  resulting  method  has  direct  benefit  to  simula¬ 
tion,  the  method  can  be  applied  to  more  general  characteri¬ 
ze  -xon  or  approximation  problems.  The  following  chapters 
present  the  proposed  method  in  detail,  discuss  computer 
implementation  of  the  method  to  include  efficient  numerical 
techniques,  and  investigate  potential  applications. 

Chapter  II  provides  a  background  summary  of  the 
information  theoretic  concepts  which  form  the  foundation  of 
the  proposed  method.  Concepts  such  as  information  variation 
and  maximum  entropy  are  discussed  as  well  as  recent  applica¬ 
tions  of  these  concepts.  Chapters  III  through  VIII  discuss 
the  proposed  character! zation  method  and  numerical  tech¬ 
niques  for  implementation.  The  method  is  applied  to  com¬ 
puter  simulation  in  Chapter  IX  followed  by  discussion  of 
method  sensitivity  .in  Chapter  X.  Additional  applications 
are  presented  in  Chapter  xl  .  The  paper  concludes  with  a 

.  .  m  <•  :  !  :.  •  a  :  v  •  uro  research  in  Chapter  XII. 
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Chapter  II ■  Background 


Distribution  Approximation 
with  Maximum  Uncertainty 

The  problem  of  interest  concerns  the  characteriza¬ 
tion  of  an  unknown  distribution  based  on  information  that 
is  provided,  or  information  that  one  may  obtain,  concerning 
the  distribution.  We  assume  that  the  unknown  distribution 
of  random  variable  X  is  generated  by  an  unknown  probability 
density  function,  f  (x)  ,  where  X  may  be  a  vector  and  thus 
f (x)  may  be  a  multivariate  distribution.  We  concentrate 
on  providing  an  algebraic  characterization  or  approximation, 
p (x ) ,  for  the  unknown  density.  We  define  "information"  as 
anything  that  is  known  or  assumed  about  the  random  variable 
or  the  distribution  of  the  random  variable.  Clearly,  the 
amount  and  nature  of  available  or  assumed  information  will 
greatly  influence  the  resulting  approximation,  p (x) .  For 
example  one  may  assume  (or  know)  that  the  unknown  density 
is  normal,  A’(u,o2),  and  obtain  further  information  in  terms 
of  a  random  sample,  x^,  i=l,2,...N.  From  the  sample,  one 
calculates 


N  N 

x  =  E  x  . /N  and  s2  =  E  (x  . -x)  2  /  (N-l ) 
i=l  1  i=l  1 


to  approximate  the  mean  and  variance  and,  thus,  determines 
the  appropriate  estimation  of  the  unknown  distribution. 


i .e . , 

p(x)  =  { 2ttS 2 )  15  exp[-(x-x)  2/(2s2 )  ] 

If  the  form  of  the  distribution  is  known,  as  in  our  example, 
then  the  problem  reduces  to  one  of  parameter  estimation. 
However,  in  many  engineering  or  statistical  problems,  suffi¬ 
cient  information  to  determine  the  form  of  an  unknown  dis¬ 
tribution  is  not  available. 

Assuming  a  particular  density  form  without  evidence 
to  support  such  a  form  will  unnecessarily  bias  the  distribu¬ 
tion  approximation  and  produce  potential  bias  or  error  in 
the  ultimate  solution  of  the  problem.  Consequently,  the 
method  of  distribution  characterize Lion  that  is  proposed  in 
this  paper  derives  a  density  form  that  utilizes  only  the 
available  information  while  maintaining  "maximum  uncer¬ 
tainty"  with  respect  to  other,  unspecified  information.  As 
discussed  in  later  sections,  the  method  assumes  that  the 
specified  information  can  be  provided  in  terms  of  average 
values  of  certain  functions  of  the  random  variable  X.  The 
method  is  based  on  a  specific  measure  of  uncertainty  for  a 
distribution,  the  distribution  entropy. 

Pntjer y  and  1  n formation  Variation 

The  entropy  function,  S,  was  defined  by  Claude 
Phannon  (Refs  71;  72;  82)  ir  1948  as  a  measure  of  the  uncer¬ 
tainty  of  a  aiven  answer  to  a  well  defined  question. 

w.-'k  centered  on  communications  theory,  but 


( 


4 


provided  a  basis  for  E.  T.  Jaynes'  extension  of  entropy  in 
statistical  mechanics  (Refs  42;  43;  44).  Myron  Tribus  (Ref 
82)  consolidates  the  work  of  Shannon  and  Jaynes  and  provides 
a  thorough  discussion  of  the  concept  of  maximum  entropy. 
Tribus  (Ref  82:111-117)  presents  a  derivation  of  the  entropy 
measure  and  several  examples  of  entropy  as  a  measure  of 
uncertainty.  We  consider  a  simplified  example  for  illus¬ 
tration  and  definition.  For  a  complete  discussion  of  the 
development  of  the  entropy  measure,  see  E.  T.  Jaynes'  1979 
article  (Ref  45:15-118). 

Consider  a  set  of  N  possible  events  with  the  proba¬ 
bility  of  occurrence  of  each  event  known.  The  probabili¬ 
ties  p^,  i=l,2,...N  are  known,  but  no  further  information 
is  available  concerning  which  event  will  occur.  Then 

S (pf p2 . . -PN)  ,  defined  by  Shannon  as  S  (p1 ,p2 . . .pN)  = 

N 

-Ip.  In  p  ,  is  a  measure  of  how  much  "choice"  is  involved 
i=l  1  1 

in  the  selection  of  a  single  event  or  how  "uncertain"  one 

is  of  the  outcome  of  event  selection.  As  an  indication  of 

this  uncertainty  measure,  consider  N  equally  likely  events, 

that  is  p^=p^=. . .=pn=1/N.  One's  uncertainty  as  to  which 

event  will  occur  increases  as  N  increases,  i.e.,  as  the 

number  of  possible  events  increases.  In  a  similar  manner, 

the  value  of  the  uncertainty  measure  or  system  entropy, 

N 

S  ( p,  ,  p_ ,  . .  .  tx, )  =  -  I  p-  In  p.  =  In  N,  increases  as  N 
*  £  i=l  1  1 

increases.  Consequently,  to  paraphrase  Shannon/Jaynes/ 


Tribus,  if  one  wishes  to  construct  a  minimally  prejudiced 
probability  distribution  (a  distribution  which  maximizes 
uncertainty)  based  on  information  about  that  distribution, 
one  must  maximize  the  entropy  subject  to  constraints  which 
are  specified  by  the  given  information. 

Before  proceeding  with  the  maximum  entropy  charac¬ 
terization  method,  we  must  extend  the  entropy  measure  to 
include  continuous  probability  density  functions.  Several 
of  the  references  discuss  the  continuous  rase  (Refs  14;  17; 

32;  89;  90),  but  Silviu  Guiasu  (Ref  33)  provides  the  most 
satisfactory  treatment.  Shannon's  work  tells  us  that  the 
entropy 

N 

S(Pl,p2...pN)=  pi  In  p.  (2.1)  £ 

provides  a  measure  of  uncertainty  for  the  finite,  N  dimen¬ 
sional  probability  space  where  p^  =  probability  of  the  ith 

N 

event;  p.  0.  i=l,2,...N;  and  Ip.  =1.  Guiasu  considers 
1  ~  i=l  1 

entropy,  in  a  comparable  fashion,  as  the  "amount  of  infor¬ 
mation"  conveyed  by  the  given  distribution.  We  now  consider 
a  continuous  probability  density,  p(x),  on  a  bounded 
interval  ta,b)  such  that  p(x)_>0  and  p(x)  dx  =  1.  One  ■ 

might  erroneously  assume  that  equation  (2.1)  is  logically 
extended,  in  the  limit,  to  the  integrable  case  in  the  form 

S(p(x))  =  -  /^p(x)  lnp(x)  dx  (2.2) 

u 
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Equation  (2.2),  in  fact,  represents  the  Boltzman  H-function 
from  classical  thermodynamics  which  was  defined  as  early  as 
1896.  The  H-function  measures  the  disorder  of  a  physical 
system  (Ref  33:14)  and  inspired  Shannon  to  study,  by  analogy, 
the  discrete  entropy  S (p1 ,p2 . . .p  ) .  However,  equation  (2.2) 
is  not  the  limiting  case  of  equation  (2.1).  Consider  the 
uniform  probability  density  on  [a,b]; 


p  (x)  = 


1 / (b-a) 
0 


a<x<b 

otherwise 


(2.3) 


Then  S(p(x))  =  -  /b  l/(b-a)  In  [l/(b-a)J  dx  =  In  (b-a)  . 

Cl 

However,  S (p^ , P2 . . .pN)  =  In  N  as  previously  stated  for 
Pj=P2~ • • • Pn  =  i-e.,  the  discrete  uniform  equivalent. 

Clearly,  limit  S (p1,p2 . . .pN)  4  S(p(x)).  Thus  the  question 
remains  of  how  to  relate  "continuous  entropy"  to  a  measure 
of  uncertainty  while  forcing  consistency  between  the  dis¬ 
crete  and  continuous  cases.  S.  Kullback  and  R.  A.  Leibler 


(Ref  50)  provided  insight  with  a  measure  of  information 
variation . 


The  Kullback-Leibler  information  discrimination 


measure  provides  a  means  of  comparing  or  measuring  the 
information  that  is  lost  or  gained  when  one  probability 
measure  replaces  a  second  probability  measure.  Kullback 
(Ref  51)  and  Guiasu  (Ref  33)  both  offer  excellent  develop¬ 
ment  of  the  information  discrimination  measure  which  is 


based  on  the  well  known  Radon-Nikodym  theorem  (Ref  35) . 

The  development  will  not  be  repeated  here,  but  we  consider 
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only  definition  and  relationship  to  entropy.  Consider 
sample  space  X  and  the  sigma  algebra,  L,  of  measurable  sets 
of  elements  of  X .  We  define  probability  measures  U^,  U2 
on  L  to  denote  probability  spaces  (X ,  L,  lh  )  ,  i=l,2.  Proba¬ 
bility  measures  and  U2  are  assumed  to  be  absolutely  con¬ 
tinuous  with  respect  to  each  other;  that  is,  for  every  set 
E  in  L  if  U1 (E) =0  then  U2(E)=0  or  if  U2(E)=0  then  U1(E)=0. 
Tnen  the  "variation  of  information"  when  we  pass  from  ini¬ 
tial  probability  measure  to  the  new  probability  measure 
U2 ,  absolutely  continuous  with  respect  to  U^,  is  the  inte¬ 
gral 

I(U2,U1)  =  fy  4 (x)  In  4  (x)  dU1(x) 

=  Jy  In  C (x)  dU2  (2.4) 

where  4 (x)  =  dU2 (x ) /dU^ (x )  is  the  Radon-Nikodym  derivative. 
If  we  now  associate  cumulative  distribution  functions 
?1 (x)  and  P2 (x)  with  measures  U^(x)  and  U2(x)  (Ref  66:261) 
where  P1(x)  and  P2(x)  represent  respective  density  func¬ 
tions,  we  may  reduce  (2.4)  to  a  measure  of  information  vari¬ 
ation  between  two  continuous  probability  density  functions: 

I(p2(x),p1(x))  =  f  p2  (x)  In  [P2  (x)  /p1  (x)  ]  dx  (2.5) 

The  variation  of  information  function,  I(U2,U^),  appears 
frequently  in  the  literature  (Refs  18;  26;  29;  73;  77;  79). 
Both  Guiasu  and  Kullback  offer  thorough  presentations  of  the 
properties  o 2  I(U2,U1). 
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Following  the  thrust  of  Guiasu's  development,  we 
may  now  relate  entropy  to  a  variation  of  information.  We 
use  equation  (2.5),  let  X=[a,b],  and  let  the  initial  dis¬ 
tribution,  p^(x),  be  the  uniform  distribution  on  [a,b]. 

The  uniform  density  is  given  in  equation  (2.3).  Then  equa¬ 
tion  (2.5)  reduces  to 

I(P2<x)  /P1(x))  =  p2(x)  1  In  p2  (x)  +  ln  (b-a)  ]  dx 

=  in  (b-a)  / ^  p2  (x)  dx  +  / ^  p2  (x)  In  p2  (x)  dx 

I  (p2 (x) ,p^ (x) )  =  ln(b-a)  -  S(p2(x)) 

Therefore,  as  Guiasu  states,  the  continuous  entropy  S(p(x)) 
may  be  interpreted  (up  to  an  additive  constant)  as  the  vari¬ 
ation  of  information  in  passing  from  the  uniform  probabil¬ 
ity  distribution  on  [a,b]  to  the  new  probability  measure 
defined  by  p(x)  on  [a,b].  A  similar  development  follows 

for  Shannon  entropy  in  the  discrete  case.  Given  p^_>0 , 

N 

i=l,2,...N  and  E  p.=l,  then  S  (p. ,p, . . .p„ )  =  In  N  -  I  (p. , 
i=l  -  1  *  *  1 

P2 . . .pN , , q2 . . . )  where  q^=l/N,  i=l,2,...N.  Thus  both 
Boltzman's  continuous  entropy  and  Shannon's  discrete 
entropy  serve  as  a  measure  of  the  variation  of  information 
when  we  pass  from  the  initial  uniform  distribution  to  the 
corresponding  probability  density  of  interest.  With  this 
confirmation,  we  proceed  to  investigate  the  maximum 
entropy  concept. 
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Maximum 


Formalism 


Tribus  (Ref  82)  formalizes  the  maximum  entropy  con¬ 
cept  for  practical  application.  The  goal  is  to  approximate 
the  unknown  distribution  of  random  variable  X  with  a  "mini¬ 
mally  prejudiced"  probability  density  function,  p(x),  based 
on  known  or  calculable  information  about  the  unknown  dis¬ 
tribution.  The  basic  underlying  principle,  as  originally 
put  forward  by  E.  T.  Jaynes,  is  here  repeated;  "The  mini¬ 
mally  prejudiced  probability  distribution  is  that  which 
maximizes  the  entropy  subject  to  constraints  supplied  by 
+  he  given  information"  (Ref  82:120).  An  adaptation  of  the 
Jaynes/Tribus  formalism  is  presented  in  the  following  steps: 

1.  Define  the  density  structure,  i.e.,  discrete  or 
continuous.  If  a  discrete  density  is  involved  then  this 
step  includes  definition  of  possible  outcomes;  the  entropy 
formalism  assumes  that  the  possible  outcomes  are  known  and 
that  we  desire  an  approximation  to  the  probability  of  each 
outcome.  For  a  continuous  density,  we  require  definition 
of  the  set  X  of  possible  outcomes,  i.e.,  the  interval  of 
integration  [a,b]  in  equation  (2.2).  Notice  that  [a,b] 
may  be  infinite.  In  practical  application,  the  interval 
may  be  determined  (or  approximated)  via  random  sample  from 
the  unknown  distribution  where  x^,  i=l,2,...N,  is  the  random 
sample  and  (a,b]  =  [mm  x^,  max  x^J.  Notice  also  that  the 
random  variable  X  mav  be  vector  valued  or  multivariate. 


10 


2.  Constrain  the  density  approximation,  p(x),  to 
satisfy  the  given  information.  We  here  consider  continuous 
densities  although  a  parallel  development  holds  for  the  dis¬ 
crete  case.  The  given  information  is  assumed  to  consist 

of  expected  values  (or  average  value  approximations)  of 
functions  g^(x),  j=l,2,...K,  of  the  random  variable  X.  In 
the  work  that  follows,  we  call  these  functions  "information 
functions"  to  indicate  their  significance  in  providing 
information  about  the  unknown  distribution.  The  formalism 
assumes  that  the  information  functions,  g^ (x) ,  and  the 
expected  values,  <gj(x)>,  are  known  or  specified.  Thus, 
constraints  on  p(x)  take  the  following  form: 

<g^(x)>  =  /bg.(x)p(x)  dx,  j=l ,  2  ,  .  .  .  K 

The  selection  of  specific  g^ (x)  and  calculation  of  <g^(x)>, 
j=l , 2 , . . . K ,  in  fact  determine  the  form  of  the  resulting 
approximation,  p(x).  Consequently,  much  of  our  effort  per¬ 
tains  to  an  intelligent  selection  of  information  functions. 
We  notice  that  the  formalism  (or  an  adaptation  of  the  for¬ 
malism)  may  still  be  applied  if  the  available  information 
takes  a  form  other  than  expected  (average)  values  of  sre- 
cified  functions.  One  such  example  is  discussed  in  the 
applications  chapter,  Chapter  XI. 

3.  Finally,  maximize  the  entropy  subject  to  the 
given  constraints. 
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Application  of  the  above  formalism,  for  the  bounde 
integrable  case,  produces  a  constrained  maximization  prob¬ 
lem: 

max  S(p(x))  =  max  {-  /bp(x)  lnp(x)  dx) 

subject  to;  /bp(x)  dx  =  1,  (2.6 

3 

/ab  (x)  p(x)  dx  =  <q.(x)>,  j  =  1 , 2  ,  .  .  .  K 

with  p(x)  unknown,  the  value  of  <g^(x)>  and  the  form  of 
g^(xi,  j=l,2,...K,  given.  Tribus  solves  this  problem  in 
the  discrete  case  using  the  Lagrange  met  hoc  of  undeterir.inc- 
ccef f icients.  The  Lagrange  method  also  applies  tc  the 
integrable  case  (Refs  27;  56).  The  analytical  form,  cf  the 
solution  is 

p(x)  =  exp[->0-l1g1  (x) .  '.KgK(x)3  (2.7 

where  V,  j-l,2,...K,  are  the  Lagrange  multipliers.  Egua- 

h ion  (2.7)  represents  the  form  of  the  mir.xmal]y  prejudiced 

maximum  entropy  distribution.  We  show  in  Chapter  IV, 

Theorem  4.],  that  equation  (2.7)  is  the  -ui  -  form  for 

the  entropy  density.  In  a  similar  sense,  equation  (2.7) 

r;'"’:on tr  a  family  cf  d.stributions  where  the  specific 

l  1  v  u~  of  Ir- tore  .t  Is  selected  through  appropriate 

.'•-••1  c-rfior  of  the  Lagrange  multiplier  vector  -  [1^,1^, 

T 

1  .  Th .  Lagrange  rultj.pl.iers  ait-  unknown  at  this 


function,  a  .  (x)  . 


is  predefined  by  the  analyst.  The  g^ (x) ,  j=l,2,...K,  may 
take  any  form  such  that  the  expected  values  are  known  or 
calculable . 

Entropy  Applications  in 
the  Literature 

Several  authors  have  discussed  application  of  the 
maximum  entropy  formalism  as  indicated  in  the  list  of  refer¬ 
ences.  Applications  to  spectral  and  time  series  analysis, 
economic  problems,  decision  theory  and  pattern  recognition 
problems,  and  physics  and  thermodynamics  problems  are 
examples  of  the  available  literature  (Refs  7;  9;  12;  58?  74; 
84).  D.  V.  Gokhala  (Ref  31)  provides  excellent  supportive 
discussion  for  entropy  characterization  based  on  known 
expected  values  of  certain  functions  of  a  random  variable. 
However,  most  of  the  available  literature  concentrates  on 
application  in  the  discrete  density  case.  The  discrete 
entropy  maximization  problem  (the  continuous  case  is  repre¬ 
sented  in  equations  (2.6))  represents  a  set  of  simultaneous 
linear  equations.  Solution  of  the  k+1  constraint  equations 
for  the  appropriate  A  is  thus  somewhat  simpler  in  the  dis¬ 
crete  case  as  compared  to  the  nonlinear  continuous  problem. 
Agmon  (Ref  3)  presents  an  algorithm  for  computer  solution  of 
the  discrete  problem.  Gokhale  (Ref  30)  presents  a  second 
approach  to  the  discrete  case  and  the  list  of  references 
provides  several  applications  to  specific  discrete  problems 
(Refs  22;  52;  83;  86;  87)  . 
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Application  of  the  entropy  method  to  continuous 
density  approximation  results  in  a  system  of  nonlinear  con 
straint  equations,  equations  (2.6),  that  must  be  solved 
simultaneously  to  find  A.  Although  solution  of  the  con¬ 
straint  equations  is,  in  general,  quite  difficult,  a  few 
specific  continuous  distributions  are  well  known.  For 
example,  if  all  that  is  known  about  the  distribution  of 
random  variable  X  on  (-oc,°°)  is  the  values  of  <x>  and  <x2>, 
then  the  resulting  maximum  entropy  distribution  is  the  nor 
n.al  distribution.  To  see  this,  consider  the  known  form  of 
the  e  lit  ropy  density: 

p(x)  =  exp  (-  '  -  A  x- 1  i 

U  1  4. 

The  normal  density  function  f(x)  with  mean  i  and  variance 
o"’  follows: 

f(x)  ---  ( 2 ~ c  ‘  )  ~ *■  e p  i  -  ( x - p )  2  /  (2c2)  ] 

=  (2-c2)'*5  exp  [ - (2a J ) -1xr  +  (u /c2 ) x  -  (y 2 /2c 2 )  ] 

*  exp! <C-(y2 /2o? ) )  +  (u/c:)x  -  (2c2) ~1x:] 

where  C  =  In  (1/  *2 tt c 2  >  •  Thus 

'  Q  =  ( u ‘  / 2 c r )  -C ,  X1  =  -  y/o2,  A 2  =  1  /2c 2 

wiil  produce  a  normal  distribution  with  mean  u  and  vari¬ 
ance  n* .  Theoretical  development  by  Guiasu  and  Tribus 
(Refs  3J:.',98;  82-131)  point  to  the  above  results;  experi- 
r  --n  t  -  f  i  c  o  ,  with  a  re- t  V  of  approximation  that  is  detailed 


14 


in  this  dissertation,  substantiates  the  expected  results. 
Other  examples  include  the  uniform  distribution  if  no  infor¬ 
mation  (except  [a,b])  is  known,  and  the  exponential  distribu¬ 
tion  if  only  <x>  (on  (0,°°))  is  known.  Much  of  the  litera¬ 
ture  on  continuous  entropy  centers  on  application  of  these 
known  entropy  forms.  For  example,  Dudewicz  and  van  der 
Meulen  (Ref  24)  utilize  known  entropy  forms  to  develop  the 
concept  of  "entropy-distinguishability"  and  entropy-based 
tests  of  hypothesis.  Other  examples  may  be  found  in  the 
list  of  references. 

Two  separate  approaches  to  continuous  density 
approximation  based  on  entropy  concepts  are  found  in  the 
literature.  The  primary  difference  between  the  approaches 
is  the  choice  of  what  we  have  called  "information  func¬ 
tions,"  i.e.,  g.(x) ,  j=l,2,...K.  B.  R.  Crain  (Refs  15;  16; 
17)  selects  the  Legendre  polynomials  as  information  func¬ 
tions  and  restricts  p(x)  to  be  an  element  of  L2[-l,l), 
i.e.,  square  integrable  functions  over  (-1,1].  The  Legendre 
polynomials  form  a  complete  orthonormal  basis  of  L2 [—1,1] 
which  leads  to  theoretically  sound  convergence  properties 
for  selected  approximation  densities.  Wilson  and  Wragg 
(Ref  89)  and  Wragg  and  Dowson  (Ref  90)  provide  good  theo¬ 
retical  development  for  a  similar  approach  on  (O,00)  where 
the  information  functions  are  taken  as  moments.  Practical 
application  of  either  method  is  restricted  by  the  need  to 
answer  the  following  questions; 
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1.  How  does  one  determine  the  number  of  moments 


or  Legendre  polynomials  that  are  needed  for  a  particular 
approximation  ? 

2.  How  does  one  find  the  necessary  Lagrange  multi¬ 
plier  vector  A? 

3.  How  useful  is  the  resulting  algebraic/analytic 
expression  for  p(x),  i.e.,  as  the  number  of  required  infor¬ 
mation  functions  increases,  does  p(x)  become  computationally 
and  conceptually  cumbersome? 

Collins  and  Wragg  (Ref  14)  took  a  logical  step 
toward  reducing  the  computational  difficulty  of  an  approxi¬ 
mation  based  on  moments.  They  reduced  the  continuous  prob¬ 
lem  to  a  discrete,  and  thus  linear,  problem  by  resorting  to 
frequency  histograms.  The  histogram  method  is  computa¬ 
tionally  appealing  but  does  not  provide  an  algebraic  repre¬ 
sentation  of  p(x) .  Young  and  Coraluppi  (Ref  91)  present 
an  interesting  approach  to  a  reduced  problem.  They  present 
an  algorithm  for  the  approximation  of  a  probability  density 
function  by  a  mixture  of  normal  density  functions  with 
unknown  means  and  variances.  Their  approach  is  also  based 
on  minimizing  an  information  criterion. 

The  following  chapters  present  a  practical  method 
io:  approximating  an  unknown  probability  density  function 
based  on  information  in  the  form  of  average  or  expected 
values  of  information  functions,  g^(x),  j=i,2,...K.  The 
he-iit  oi  tiie  method  is  intelligent  selection  of  the 
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information  functions  from  a  large  set  of  potential  infor¬ 
mation  functions.  The  method  builds  on  the  maximum  entropy 
formalism  as  outlined  in  this  chapter.  The  theoretical 
development  includes  theorems  on  the  form  and  uniqueness 
of  the  approximating  density  for  a  given  set  of  informa¬ 
tion  (Ref  Chapter  IV) .  The  resulting  method  is  computa¬ 
tionally  feasible  and  efficient,  and  numerical  techniques 
for  implementation  are  demonstrated. 
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Chapter  III .  Entropy  Approximation 


Introduction 

The  distribution  approximation  problem,  as  intro¬ 
duced  in  Chapter  II,  is  refined  in  the  first  section  of  this 
chapter.  The  second  section  presents  the  general  approxi¬ 
mation  procedure  which  results  from  application  of  maximum 
entropy  concepts.  Subsequent  chapters  explore  the  detailed 
steps  of  the  general  procedure  and  specific  applications. 

Problem  Refinement 

The  problem  of  interest  concerns  approximation  of 
the  unknown  distribution  of  random  variable  X  based  on 
information  that  is  provided,  or  information  that  one  may 
obtain,  concerning  the  unknown  distribution.  We  are  par¬ 
ticularly  interested  in  approximating  the  output  distribu¬ 
tions  of  computer  simulations.  The  goal  of  this  research  is 
to  produce  an  approximation  method  that  is  theoretically 
sound,  suitable  for  practical  application,  and  specifically 
adaptable  to  computer  implementation.  With  the  goal  in 
mind,  we  may  apply  the  entropy  formalism  of  Chapter  II. 

Our  previous  adaptation  of  the  entropy  formalism 
includes  three  steps:  define  the  density  structure,  con¬ 
strain  the  density  to  given  information,  and  select  the  spe¬ 
cific  density  that  maximizes  the  entropy.  First  we  define 
the  density  structure.  This  paper  will  restrict 
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investigation  to  continuous  densities  on  the  bounded  inter¬ 
val  [a,b].  The  investigation  centers  on  characterization 
of  univariate  distributions  where  we  assume  that  the  dis¬ 
tribution  is  generated  by  an  underlying,  unknown  density. 

We  seek  a  representation  or  approximation  for  the  unknown 
density.  While  we  concentrate  on  univariate  distributions, 
notice  that  nothing  in  our  conceptual  or  theoretical  devel¬ 
opment  precludes  extension  of  the  method  to  multivariate 
distributions,  i.e.,  where  random  variable  X  is  vector 
valued.  The  density  structure  defined,  we  now  proceed  with 
the  entropy  formalism. 

Approximation  Procedure 

As  previously  discussed,  the  unknown  density  func¬ 
tion  for  random  variable  X  will  be  approximated  by  a  maxi¬ 
mum  entropy  function  of  the  form 

p(x)  =  exp  [-\0--X1g1  (x) -.  .  ,XKgK(x>  ]  (3.1) 

where  the  g^(x),  i=l,2,...K  are  "information  functions," 
and  the  X^,  i=0,l,2,...K  are  Lagrange  multipliers.  The  key 
to  providing  an  accurate  representation  of  an  unknown  den¬ 
sity,  thus  the  key  to  our  approximation  procedure,  rests 
in  the  ability  to  select  the  proper  information  functions 
and  the  appropriate  Lagrange  multipliers.  The  approxima¬ 
tion  procedure  is  thus  composed  of  three  basic  steps: 
selecr  the  appropriate  information  functions:  calculate  the 


expected  or  average  values,  <g^{x)>,  i=l,2,...K;  and  solve 
the  entropy  maximization  problem  for  the  Lagrange  multi¬ 
pliers.  We  now  consider  the  steps  in  more  detail. 

Information  Function  Selection .  The  form  of  the 
provided  or  calculated  information,  i.e.  ,  the  forms  of  the 
information  functions,  and  the  amount  of  information,  i.e., 
the  number  of  information  functions,  determine  the  form  of 
the  resulting  entropy  density  as  shown  in  equation  3.1. 
Clearly,  specifying  the  wrong  information  functions  or  too 
little  information  may  lead  to  an  unacceptable  approxima¬ 
tion.  Moments  (Ref  90)  and  orthogonal  polynomials  (Ref  15) 
are  examples  of  possible  information  functions.  Our  pro¬ 
cedure  allows  great  flexibility  in  definition  of  informa¬ 
tion  functions. 

A  two-phased  approach  is  used  in  specifying  the 
information  functions  that  best  approximate  a  particular 
continuous,  univariate  density  on  [a,b].  The  first  phase 
includes  specifying  a  large,  general  class  of  "potential" 
information  functions  that  have  particular  conceptual  or 
theoretic  value.  For  example,  all  moments  of  a  random  vari¬ 
able  provide  an  extensive  amount  of  information  and  would 
comprise  a  feasible  potential  set.  For  reasons  indicated 
in  Chapter  II  and  expanded  in  Chapter  V,  moments  do  not  pro¬ 
vide  a  practical  set  of  functions.  A  more  useful  potential 
set  is  discussed  in  Chapter  V.  The  potential  set  should 
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be  large  to  allow  consideration  of  a  wide  range  of  useful 
functions.  Use  of  the  entire  potential  set  to  specify  the 
entropy  density,  equation  3.1,  would  lead  to  a  numerically 
intractable  problem  in  solving  for  the  Lagrange  multiplier 
(just  as  using  all  moments)  and  a  conceptually  dissatisfy¬ 
ing  form  for  the  approximation,  p(x).  In  fact,  much  of  the 
information  may  be  redundant  or  unneeded  when  approximating 
a  particular  density.  Thus,  in  phase  two,  we  seek  the 
minimum  subset  of  the  large  potential  set  that  will  accept¬ 
ably  approximate  the  density  of  interest.  The  minimum  sub¬ 
set  will  be  called  the  "active  set"  of  information  func¬ 
tions.  Thus,  phase  one  is  definition  of  a  large  class  of 
potential  functions  that  will  serve  in  a  wide  variety  of 
characterization  or  approximation  problems,  while  phase  two 
is  selection  of  the  active  set  of  information  functions  that 
pertain  to  a  specific  approximation  or  characterization 
problem. 

A  point  of  clarification  is  needed.  We  will  fre¬ 
quently  interchange  the  terms  characterization  and  approxi¬ 
mation  when  referring  to  the  entropy  procedure.  As  we  will 
see  in  Chapter  V  and  subsequent  chapters,  the  entropy  pro¬ 
cedure  will  exactly  characterize  the  unknown  density  (given 
computational  accuracy)  if  the  potential  information  func¬ 
tion  set  contains  the  correct  functions.  Chapter  II  pro¬ 
vided  such  an  example  for  the  normal  distribution  with  func¬ 
tions  x  end  -'-2  ,  If  the  correct  functions  are  not  present, 
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then  the  procedure  provides  an  approximation  to  the  unknown 
distribution.  Thus,  assuming  a  broad  potential  set,  inter¬ 
change  of  the  two  words  is  permissible. 

Generation  of  Expected  or  Average  Values.  Our 
entropy  approximation  procedure  requires  that  the  informa¬ 
tion  be  given  in  terms  of  expected  values  of  the  informa¬ 
tion  functions,  <g^(x)>,  i=l,2,...K,  or  average  value 
approximations  to  the  expected  values.  The  method  used  to 
obtain  the  <g^(x)>,  i=l,2,...K,  is  transparent  to  the  char¬ 
acterization  procedure;  in  fact,  alternate  methods  exist. 

For  example,  the  analyst  may  possess  information  about  the 
unknown  distribution  which  was  accumulated  through  years 
of  experience  or  repeated  trials.  If  this  information  is 
available  ir.  the  form  of  average  values,  then  the  entropy 
method  may  be  applied  directly.  For  a  more  standard 
approach,  we  assume  that  a  random  sample  of  size  N  is  avail¬ 
able  for  random  variable  X,  i.e.,  x^,  j=l,2,...N.  The 
expected  values  are  then  approximated  by  average  values: 

N 

<g. (x) >  -  E  g .  (x  .) /N ,  i=l,2,...K  (3.2) 

1  j=l  3 

The  accuracy  of  the  approximation  in  equation  3.2  is  depen¬ 
dent  on  the  sample  size  N.  A  third  method  which  is  heavily 
used  in  this  research  is  numerical  quadrature. 
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Numerical  quadrature,  or  numerical  integration, 
provides  an  effective  means  of  computer  integration.  Quadra¬ 
ture  plays  a  key  role  in  application  of  our  characterization 
procedure  to  computer  simulation,  is  a  required  tool  for  the 
numerical  scheme  which  we  employ  to  find  the  Lagrange  multi¬ 
pliers,  and  can  be  used  to  calculate  expected  values.  A 
detailed  discussion  of  one  quadrature  form,  Gauss-Legendre 
quadrature,  may  be  found  in  Appendix  A.  The  general  quadra¬ 
ture  form  follows: 


/  k  q ( x )  dx 
a 


b-a 

2 


m 

j=l 


W_.  q  (x  _ ) 


(3.3) 


where  aoc^oc 
Appendix  A. 


_<...<x  <b  and  the  W.  and  x  are  defined 
2-  -  m-  j  j 

Now  consider  the  expected  value  equation 


in 


<g.  (x)  >  =  /  b  q  .  (x)  f  (x)  dx 

1  CL  '  1 


(3.4) 


where  f (x)  represents  the  unknown  density  that  we  wish  to 
approximate.  If  the  values  of  the  unknown  density  can  be 
approximated  at  the  points  x^,  j=l,2,...m,  then 


h  ••  cl  ^ 

<g  .  (x)  >  ~  — —  F  Wj  g^  (x^)  f(x^) 


The  values  j=l,2,...m,  might  be  reasonably  approxi¬ 

mated  by  a  frequency  table  or  even  numerical  differentia¬ 
tion  of  a  sample  cumulative  distribution. 
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A  very  important  application  of  our  characteriza¬ 
tion  method  is  to  computer  simulation.  The  key  role  of 
quadrature  in  this  application  is  introduced  at  this  point 
for  consistency  and  is  expanded  in  Chapter  IX.  Consider 
the  simplified  simulation  model: 

h(y)  on  [a,b] - ►  F  (y)  — ►  f  (x)  on  [c,d] 

where  h(y)  is  the  known  probability  density  function  of 
input  random  variable  Y  on  [a,b],F(y)  is  a  mathematical 
transformation  representing  the  simulation,  and  f (x)  is  the 
unknown  probability  density  function  for  random  variable  X 
that  we  wish  to  approximate.  We  apply  basic  transformation 
of  variables  techniques  (Ref  39:127)  to  equation  3.4  to 
obtain  the  following: 

<g^  (x)  >  =  /cd  g.^  (x)  f  (x)  dx  =  (F  (y)  )  h  (y)  dy  ; 

and  applying  equation  (3.3) 

h  m 

<g  (x)>  =  SZS.  I  W  g  (F  (y  .)  )  h(y .)  (3.5) 

1  ^  j=1  3  i  3  3 

i=  1 , 2 , . . .K. 

Thus  we  may  calculate  the  expected  values  of  information 
functions  for  a  computer  simulation  by  sampling  from  the 
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simulation  at  m  predefined  points.  The  benefits  of  this 
result  are  pursued  in  Chapter  IX. 

Laaranae  Multipliers.  Once  the  active  set  of  infor 
mation  functions  has  been  selected  and  the  expected  values 
have  been  calculated  or  approximated,  the  constraint  equa¬ 
tions  must  be  solved  to  find  the  K+l  Lagrange  multipliers, 
i=0,l,...K,  where  K  is  the  number  of  functions  in  the 
active  set.  The  simultaneous  solution  of  K+l  nonlinear 
equations  is  a  difficult  task  analytically  and  numerically. 
Acton,  Fox,  Luenberger,  and  Saaty  and  Bran  (Refs  2;  27; 

7 7 ;  67)  describe  various  numerical  approaches.  The  method 
of  choice  in  this  paper  is  the  Newton  method.  A  computer 
program  to  solve  for  the  K+]  lambdas,  given  K  expected  value 
and  the  forms  of  the  information  functions,  using  the 
Newton  method  has  been  implemented.  See  reference  3  for  a 
similar  approach  to  the  discrete  density  problem.  Exist¬ 
ence  and  uniqueness  properties,  and  a  numerical  scheme  for 
general  solution  of  the  constraints  are  discussed  in  the 
next  chapter. 

Resulting  Density  Function .  The  form  of  the  maxi¬ 
mum  entropy  approximation  is  known:  p(x)  =  exp [ - ^ g^ (x) 

-  .  .  ,  \j.g  (x)  ]  .  The  specific  entropy  density,  p(x),  that 
will  approximate  or  characterize  the  unknown  density,  f (x) , 
is  selected  through  anpljeation  of  the  above  procedural 
steps.  We  summarize  the  procedure.  The  active  information 


function  set  is  selected  to  include  the  form  and  number  of 


information  functions.  Average  or  expected  values  of  the 
information  functions  are  generated.  Finally,  the  Lagrange 
multipliers  are  calculated  via  the  Newton  method,  and  p(x) 
is  completely  defined. 

Application  of  this  method  has  produced  excellent 
approximations  in  numerous  test  cases.  The  specifics  of 
the  above  procedure,  to  include  test  examples  and  applica¬ 
tions,  are  discussed  in  the  following  chapters. 
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Chapter  IV.  Solution  of  the  Constra int  Equations 
Introduction 

The  theoretical  backbone  of  our  entropy  characterization 
method  is  presented  in  the  first  section  of  this  chapter. 
The  entropy  approach  is  based  on  the  solution  of  a  con¬ 
strained  optimization  problem.  We  present  the  probleip  and 
derive  Theorem  4.1  which  defines  the  form  of  the  solution 
density , 


P ( x )  =  exp[-AQ, A1g1 (x) -. . .Xkgk(x) ]  (4.1) 

We  then  addres=  solution  of  the  constraint  equations  for 

T 

the  lambda  vector ,  A= ( Aq , A 1 . . . Ak>  ,  which  equates  to  selec¬ 
tion  of  a  particular  density  from  the  family  represented  in 
equation  4.1.  Two  theorems  pertaining  to  uniqueness  pf 
solution  are  presented.  Theorem  4.2  shows  that,  given 
existence,  there  is  only  one  solution  vector  that  maximizes 
the  entropy.  However,  iterative  solution  of  the  constraints 
may  lead  to  a  local  optimum  as  shown  in  Theorem  4.3.  The 
first  section  concludes  with  Theorem  4.4  which  shows  that 
the  average  values  of  our  information  functions  are  complete 
sufficient  statistics  for  selection  of  a  specific  p(x). 

The  second  section  of  the  chapter  presents  a  numerical 
scheme  to  apply  the  theory.  Performance  of  the  scheme  and 
numerical  sensitivities  are  discussed. 
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Theoretical  Development 

Form  of  the  Solution .  We  assume  that  the  forms 
and  number  of  "active"  information  functions  are  known,  and 
that  expected  values  are  given;  thus,  g^  (x)  and  <g^(x)>, 
j-l,2,...k,  are  known.  We  wish  to  find  the  continuous  den¬ 
sity,  p(x) ,  on  [a,b]  that  will  satisfy  the  given  informa¬ 
tion,  i.e.,  produce  the  given  expected  values,  while  main¬ 
taining  maximum  uncertainty  with  respect  to  other, 
unspecified  information.  The  mathematical  statement  of 
this  problem  is  repeated  from  equations  2.6: 

max  S  (p  (x )  )  =  max  (-/_bp(x)  lnp(x)  dx) 

d 

subject  to 

A 

'b  p  ( x )  dx  -  1  , 

d 

/b  g  _( x )  p(x)  dx  =  '  g  .  (x)  >  ,  j=l,2,...k  (4.2) 

a  1  ] 

Ke  assume  that  the  g^  (x) ,  j=l,2,...k,  are  continuous  and 
bounded  on  [a,b|.  In  terms  of  a  probability  space,  we  con¬ 
sider  probability  space  (X ,  L ,  U)  where  X  is  the  interval 
[a,b],  L  is  the  sigma  algebra  of  Lebesgue  measurable  sets 
on  X,  and  the  probability  measure  U  on  L  is  defined  by  the 
probability  density  function 

p ( x ) _0 ,  a  x<b,  / b  p ( x )  dx  = 1  (4.3) 

cl 

Wo  apply  the  Lagrange  method  of  undetermined  coefficients 
to  equation.'-  4.2  to  find  the  density  in  4.3.  The 
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Lagrangian,  L(p(x),A),  follows: 

k 

L(p(x)  ,A)  =  S(p(x))  -  X- (/p(x)  dx-1)  -  I  X  (/g  (x)p(x)  dx-<g . (x)>  ) 

u  j=l  3  3  3 


k  k 

=  /p(x)  fln(l/p(x)  )-Xn-  Z  x.g.  (x)]dx+  X  +  Z  X.<g.(x)> 

j=l  3  3  u  j=l  3  3 


k  k 

=  /p(x)  {In  (  (1/p  (x) )  exp  (-X  -  Z  X.g  (x)) )  }dx  +  X  +  Z  X  <g  (x)  > 

u  j=l  3  3  u  j=l  3  3 


We  apply  the  knowledge  that  for  all  x>0 


In 

(x)  <  x-1 

if 

X*1 

and 

In 

(X)  =  x-1 

if 

X  =  1 

to  get 

k  k 

L(p(x)  ,A)</p(x)  [  (1/p  (x) )  exp  (-Xn-  Z  X.g.  (x))  -  1]  dx  +  X  +  Z  X  ,<g .  (x)  > 

j=l  3  3  j=l  3  3 


Since  we  want  to  maximize  L(p(x),A),  we  seek  equality  in 
our  last  expression  which  occurs  if  and  only  if  p(x)  = 


exp[-X  -  Z  X.g.(x)],  The  preceding  result  is  well  known 
U  j=l  3  3 

and  is  mentioned  in  several  references  without  proof  for  the 
continuous  case  (Ref  42;  82;  89).  This  derivation  is  given 
to  enhance  clarity.  The  derivation  is  a  generalization  of 
work  presented  by  Guiasu  (Ref  33:298-301)  and  attributed  to 
Kampe  de  F^riet  (Ref  48)  and  Ingarden  and  Kossakowski 
(Ref  41).  The  Guiasu  presentation  was  concerned  with  the 
normal  and  Poisson  distributions.  We  summarize  with  a 
theorem. 
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Theorem  4.1.  Given  probability  space  (X ,  L,  U) 
where  X  is  the  interval  [a,b],  L  is  a  sigma  algebra  of 
Lebesgue  measurable  sets,  and  U  is  defined  in  terms  of  p(x) 
as  in  equation  4.3,  then  the  density  function,  p{x),  which 
maximizes  the  entropy  subject  to  constraints  as  represented 
in  equations  4.2  is  of  the  form 

p(x)  =  exp[->o~>1 g1(x) . .-  >k  gk  (x) ] 


..’hero 


the 


o  .  (x) 

3 


are  continuous, 


bounded  functions  on  X. 


Theorem  4.1  provides  the  form  of  the  entropy  char¬ 
acterization  density.  Given  a  specific  set  of  expected 

values,  <a_j(x)>,  j=l,2,...k,  we  solve  the  k  +  1  constraints 

T 

for  /.  --  ( '■  o  ' '' i '  •  •  •  to  completely  determine  p(x).  We 

now  relate  concepts  from  Tribus  (Ref  82) ,  Guiasu  (Ref  33) , 
and  Kullback  (Ref  51)  with  the  special  properties  of  our 
problem  to  discuss  existence  and  uniqueness  properties. 


Existence .  The  existence  of  a  solution  to  the  con¬ 
straints  is  not  guaranteed.  We  may  clearly  specify  a  set 
of  expected  values  which  form  an  inconsistent  set  of  con- 
staints  and  for  which  no  density  exists.  For  example,  con¬ 
sider  k  =  2,  g, (x)=x,  g2(x)=x2,  <g1(x)>=10,  and  <g2<x)>=99. 

As  discussed  in  Chapter  II  of  this  paper,  the  maximum 
entropy  density  given  only  <x>  and  <xJ>  is  the  normal  den¬ 
sity.  Consider  the  variance  of  the  density  we  have  speci- 
f  i  r.  : 
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Var  =  < (x-<x>) 2 >  =  <x2 >  -  <x> 2 


99  -  100 


or  Var  =  -1.0 

This  example  thus  asks  for  a  normal  density  with  negative 
variance  which  is  not  possible.  Collins  and  Wragg  (Ref  14) 
state  that,  in  the  general  case,  the  precise  conditions  on 
the  expected  values  of  moments  for  which  a  A  vector  will 
exist,  where  the  A  vector  satisfies  the  constraints,  do 
not  seem  to  be  known.  Only  for  information  functions 
g^(x)=x  (k=l)  and  g^(x)=x,  g2(x)=x2  (k=2)  are  conditions 

known  in  any  completeness.  Wragg  and  Dowson  (Ref  90), 
Widder  (Ref  88),  and  Ahiezer  and  Krein  (Ref  4)  provide 
extended  discussion  of  conditions  for  valid  moment  sequen¬ 
ces  . 

We  are,  however,  concerned  w’ith  practical  applica¬ 
tion  and  our  problem  is  somewhat  restricted.  Our  problem 
centers  on  approximating  an  "existing,"  though  unknown, 
density.  Samples  from  the  unknown  density  are  used  to 
approximate  the  expected  values  via  quadrature  or  sample 
averages.  Thus,  we  have  a  consistent  set  of  constraints 
provided  that  the  expected  value  approximations  are  accu¬ 
rate.  Inconsistencies  that  result  from  sampling  or  compu¬ 
tational  errors  may  be  alleviated  by  increasing  sample 
size,  or  increasing  the  number  of  quadrature  points,  and 
including  computational  checks  to  produce  more  accurate 
expected  value  approximations.  We  thus  assume  a  consistent 
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set  of  constraints,  i.e.,  the  unknown  density,  f(x),  satis¬ 
fies  the  constraints.  Assuming  an  intelligent  choice  of 
information  functions,  as  discussed  in  later  chapters,  we 
will  produce  a  p(x).  Theorem  4.1,  that  acceptably  approxi¬ 
mates  f (x) .  In  this  manner,  the  existence  problem  is  con¬ 
ceptually  translated  to  a  problem  of  specifying  the  correct 
information  functions.  For  purposes  of  this  paper,  we 
assume  existence  of  a  solution  vector,  A  =  (X q , X i , . . . X. )  , 

for  a  given  expected  (or  average)  value  vector,  <G>  = 
(l,<g^(x)>,<g^(x)>,...<g^(x)>)^'. 

Uniqueness .  We  wish  to  discuss  uniqueness  in  two 
respects.  First  we  show  that  if  there  exists  a  second  den¬ 
sity  on  fa,b1 ,  p ( x ) ,  where  p(x)  may  take  any  form  such  that  3 

p(xl_0  and  p(x)0  almost  everywhere  (a.e.)  on  [a,b],  p(x) 
satisfies  the  constraints,  and  p(x)  maximizes  the  entropy, 
then  p(x)=p(x)  a.e.  Secondly,  we  show  that  the  solution, 

A,  which  maximizes  the  entropy  is  a  global  solution,  i.e., 
if  there  exists  a  P  -  (8^,8^,...  By)  such  that 

p(x)  =  exp  [-Bg-B^  g^  (x)  - .  . .  8^  gk  (x)  ] 

where  p(x)  satisfies  the  constraints  and  maximizes  the 
entropy ,  then  8=A .  We  will  combine  these  results  in  one 
theorem.  The  approach  is  to  assume  a  second  solution, 
pfx)  ,  of  any  form  such  that  p(x)2.0  for  a]l  x  in  [a,b]  and 
f:  *  •• )  '0  a.e.  on  (a  ,H  .  We  then  consider  the  entropies  of 
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our  two  solutions,  F=S (p (x) ) -S (p (x) ) ,  and  show  F^.0  a.e.  and 
F=0  if  and  only  if  p(x)=p(x)  a.e.  We  then  assume  the  exist¬ 
ence  of  a  8  vector  and  obtain  our  second  result.  The 
theorem  and  proof  are  motivated  in  a  proof  by  Tribus  of  a  dis¬ 
crete  maximum  (Ref  82:123)  and  a  theorem  on  information 
discrimination  by  Kullback  (Ref  51:14). 

Prior  to  a  statement  of  our  theorem,  we  discuss  the 
Kullback  theorem.  The  Kullback-Leibler  information  dis¬ 
crimination  measure  was  introduced  in  the  background  chap¬ 
ter  of  this  report  (equation  2.4).  In  keeping  with  Kull¬ 
back,  we  use  the  probability  spaces  of  Chapter  II,  (X ,  L , 

U^)  ,  i=l,2,  and  define  a  third  probability  measure,  U.  Let 
U  be  absolutely  continuous  with  respect  to  (w.r.t.)  tL  and 
IL  be  absolutely  continuous  w.r.t.  U,  i=l,2;  for  example, 

U  may  be  U^,  or  U2 ,  or  (U^+U2> /2 .  The  Radon-N ikodym 
derivatives  are  now  defined  in  terms  of  U  as  4^  (x) =dlL  (x) / 
dU  (x)  where  <?^(x),  i=l,2,  are  functions,  unique  up  to  sets 
of  measure  (probability)  zero  in  U,  0<4^  (x)<“  such  that 
U  .  (E)  =/_4> .  (x)  dU  (x)  ,  i=l,2,  E  an  element  of  L.  Using  this 

1  b  1 

nomenclature,  we  rewrite  the  discrimination  measure  of 
Chapter  II,  Id^rU^),  to  an  equivalent  form: 

$(x)  <J)  ( x ) 

I(U2,U1)  =  /In  ^(x)  dU2(x)  =  /4~2  (x)  In  ^-{xy  dU(x) 

(4.4) 
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Kullback  Theorem.  Id^iU^)  is  almost  positive 
definite;  that  is,  Id^U^^O,  with  equality  if  and  only  if 
<* ^  (x)  =0 2  (x)  a.e.  w.r.t.  U.  See  Kullback  (Ref  51:14)  or 
Guiasu  (Ref  33:22)  for  proof  of  this  theorem. 

Armed  with  the  Kullback  Theorem,  we  turn  to  the 
question  of  uniqueness  of  p(x)  and  uniqueness  of  the 
Lagrange  multiplier  vector,  A.  We  have  p(x)  as  shown  in 
equation  4.1  and  k+1  constraints: 

/ k  a  .  ( x )  p(x)  dx  =  <g.(x)>,  j=0,l,...k 
a  j  J 

where  gQ(x)El.  For  a  specified  set  of  expected  values, 

T 

<G>q= (1 , <g^ (x) > , . . . <gk (x) >)  Q  ,  we  solve  the  constraints 

T 

for  Aq=  ( A 0 , . . . X^)  0  to  completely  determine  p(x).  From 
Theorem  4 . 1  we  know  that  p(x)  maximizes  the  entropy, 
S(p(x)).  Now  assume  that  there  exists  density  p(x)  that 
satisfies  the  constraints  where  p(x)  may  take  "any  form" 
subject  to  two  conditions:  p(x)_>0»  x  in  [a,b]  and  p(x)>0 
a.e.  on  [a,b].  The  second  condition  is  needed  to  insure 
that  the  probability  measures  associated  with  p(x)  and 
£>(x)  are  absolutely  continuous  w.r.t.  each  other.  We  wish 
to  determine  if  p(x)  is  also  of  maximum  entropy.  We  may 
represent  our  state  of  knowledge  with  two  sets  of  equa¬ 
tions  . 


t- 
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s  =  -f”  p(x)  lnp(x)  dx, 


with 


S  *  p(x)  lnp(x)  dx 


with 


/_  gs(x)  p(x)  dx  =  <g.(x)>, 
a  J  j 


j=0, 1 , . .  .k 


and 


p(x)  =  exp  [-  Z  g  .  (x)  X  .] 
j=0  3  3 


/,  gs  (x)  p(x)  dx  =  <g.(x)>, 
a  J  3 


j=0,l,  . . .k 


and 


p  (x) >0,  xc  [a ,b]  , 
p(x)~0,  a.e.  on  [a,b]. 


All  integrals  in  the  following  derivation  are  over  the 
interval  [a,b]  although  the  limits  of  integration  will  not 
be  shown.  Consider, 

F  =  S  -  S  =  /p(x)  lnp(x)  dx  -  /p(x)  lnp(x)  dx 
Now  add  and  subtract  /p(x)  lnp(x)  dx, 


F  =  /p  ( x )  In  ( p  ( x  )/p  ( x ) )  dx  +  /(p(x)-p(x))  In  p  (x)  dx 


We  substitute  for  the  known  form  of  p(x)  in  the  last 
integral : 

k 

F  =  / p(x)  In  (p(x)/p(x))  dx  +  /(p(x)-p(x))  Z  X  .g  .  (x)  dx 

j=0  3  3 


k 

=  / p(x)  In  (p{x)/p(x))  dx  +  Z  X  .  {/g  .  (x)p(x)  dx - 

j=0  3  3 


Sg j  (x)  p (x)  dx} 

Clearly,  the  last  (k+1)  terms  cancel  due  to  the  require¬ 
ment  of  constraint  satisfaction  and  thus, 


35 


F  =  S  -  S  =  / p(x)  In  (p(x)/p(x))  dx 


(4.5) 


Since  p(x)  and  p(x)  are  probability  density  functions  on 
[a,b] ,  we  may  define, 

P(x)  =  prob  (X^x)  =  /  x  p(y)  dy  and 

P{x)  =  prob  (X<x)  =  /  x  p  ( y )  dy,  with 

a 

P (x)  =  P (x)  =  0  if  x<a,  and 
P (x)  =  P (x)  =  1  if  x>b, 

as  probability  measures  on  [ a , b ] .  We  know  p(x)  and  p(x)>0 
for  all  x  in  [a,b] ,  except  for  a  set  of  measure  zero,  and 
by  definition  p(x)=p(x)=0  for  all  x  not  in  [a,b].  Thus 
P (x)  is  absolutely  continuous  w.r.t.  P (x)  and  vice  versa. 

We  are  now  in  a  position  to  apply  the  Kullback 
Theorem.  With  the  Kullback  nomenclature,  we  let 

U^(x)  =  U(x)  =  P(x);  (x)  =  P(x); 

^(x)  =  dP(x)/dP(x);  (x>  =  dP(x)/dP(x); 

I(U2,U1)  =  /$2(x)  In  ($2 (x) /i^  (x) ) dU (x) 

=  /  (dP  (x)  /dP  (x)  )  In  (dP(x)  /dP(x))dP(x) 

I(u-,U  '  =  /p (x)  In  (p (x) /p (x) )  dx,  (4.6) 

We  equate  equations  4.5  and  4.6  to  obtain 

F  =  S  -  S  =  I (U^,U. ) . 

/.  1 
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Thus  S-S^O  with  S-S=0  if  and  only  if  (x) =  $2  (x)  a.e.  w.r.t. 

U,  or  dP  (x) /dP  (x)  =dP  (x) /dP  (x)  a.e.  w.r.t.  P(x)  which  implies 
that  p(x)=p(x)  a.e.  w.r.t.  P (x) .  Thus  we  obtain  the  impor¬ 
tant  result  that  either  p(x)  is  the  only  form  of  solution 
that  maximizes  the  entropy  or  any  other  solution,  p(x), 
must  satisfy  p(x)=p(x)  a.e. 

We  take  this  development  one  step  further  by  assum¬ 
ing  the  existence  of  a  Lagrange  multiplier  vector  such 

k  u 

that  p(x)=exp[-  I  8.  g. (x) ] ,  p(x)  satisfies  the  constraints, 
j=0  1  3 

and  p(x)  has  maximum  entropy.  Thus  S-S=0  and  p(x)=p(x) 
a.e.  which  implies 

lnp(x)  =  lnp(x)  a.e.,  and 

k  k 

-  I  L  g.(x)  =  -  I  B.g.(x)a.e.,  or 

j-Q  J  J  j-Q  J  J 


k 

l  ( 6 . — X  )  g  (x)  =  0  a.e. 
j=0  ->  3  ■> 


If  the  g^ (x)  are  linearly  independent  functions,  then  the 
only  linear  combination  of  g^(x)  that  equals  zero  a.e.  is 
if  (Bj-Xj)=0  for  all  j.  Thus  B^  =  Xj  f°r  j=0,l,...k.  We 
summarize  the  above  developments  in  a  theorem. 


Theorem  4.2.  Let  there  exist  A^= ( Xq , , . . . X^)  Q 
k+i  k 

with  An  an  element  of  R  such  that  p(x)=exp[-  l  X.g.(x)], 
0  j=0  -* 
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g  (x)=l,  and  p(x)  satisfies  the  constraints  /  g.(x)p(x)  dx 

o  a  j 

=<g.{x)>,  j=0,l,...k,  with  S  (p  (x)  )  =-/ ^  p  (x)  lnp(x)  dx.  If 
j  a 

there  exists  a  function  p(x)  such  that  p(xijM)  for  all  x  in 

[a,b]  and  p(x)>0  a.e.  on  [a,b],  then  S (p (x) ) (p (x) )  and 

S (p (x) ) =S (p (x) )  if  and  only  if  p(x)=p(x)  a.e.  on  [a.b.] 

k  +  l  ^ 

If  there  exists  a  6  in  R  such  that  p(>:)=exp[-I  6.g.(x)), 

j  =  0  ^  ^ 

with  linearly  independent  g^  (x) ,  and  p(x)  satisfies  the 
constraints,  then  S  (p  (x) )  =S (p (x) )  if  and  only  if  Xj=S., 

J  ~  0 , 1  ,  .  .  .  k  . 

We  have  established  the  form  and  uniqueness  of 
solution  for  our  optimisation  problem  given  that  a  solution 
exists.  We  now  directly  explore  the  constrain*-  equations 
and  the  possibility  of  "local"  optimum  solutions.  Only 
one  A=Aq  exists  for  which  p(x)  satisfies  the  constraints 
and  S(p(xl)  is  maximum;  however,  there  may  exist  a  A  =  8 
with  corresponding  p(x)  which  satisfies  the  constraints, 
and  where  S  (p  (x) )  <  S  (p  (x) )  .  If  there  exists  a  neighborhood 
of  p(x) ,  where  p(x)  is  not  an  element  of  the  neighborhood, 
such  that  S  (p  (x)  )  _^S  (q  (x) )  for  all  q(x)  in  the  neighborhood 
then  p(x)  is  a  local  optimum.  Local  optimums  are  of  con¬ 
cern  because  the  k+1  nonlinear  constraint  equations  must 
oe  solved  with  iterative  numerical  techniques,  and  such 
techniaues  may  converge  to  local  optimums. 

Given  that  AQ  is  a  solution  vector  for  the  non¬ 
linear  system  of  equations  F(A)=<G>  where 


p(x)  =  expl-X0g0(x)-A1g1(x)-.  .  .Xkgk(x)  ]  ,  gQ(x)  =  l,  and  all 
integrals  are  over  the  interval  [a,b] ,  then  the  following 
theorem  addresses  solution  uniqueness. 


Theorem  4.3.  Let  gQ(x),  g^  (x)  ,  .  .  . gk  (x)  ,  finite  k, 
be  continuous  functions  in  L2  [a,b]  and  g^ (x)  g^ (x)  be  in 
L2  [a ,b]  for  i,j=l,...k,  gg(x)5i.  if  A  is  a  solution  of 
F(A)=<G>  for  a  specific  <G>g,  and  if  the  g^ (x) ,  j=0,l,...k, 
are  linearly  independent  functions,  then  there  exists  a 
neighborhood,  W ,  of  AQ  where  AQ  is  the  unique  solution 
of  F(A)=<G>0. 

Proof.  Consider  F ( A )  to  be  a  function  from  some 
k+1  k+1 

subset  of  R  to  R  .By  the  fundamental  Inverse  Func¬ 
tion  Theorem  (Ref  73:354),  if  F (A )  is  continuously  differ¬ 
entiable  in  some  neighborhood  of  Ag  and  if  the  linear  trans¬ 
formation  F'(Aq)  is  invertible  (nonsingular),  then  there 
exists  neighborhoods  W  and  V  of  Ag  and  F(Ag),  respectively, 
such  that  F:  W-*V  is  a  one-to-one,  onto  mapping.  Thus, 
given  solution  vector  Ag,  for  all  Befc',  F(6)=<G>q  if  and 
only  if  8=Ag,  or  AQ  is  the  unique  solution  vector  in 
neighborhood  W.  We  will  show  that  F  is  continuously 
differentiable  in  some  neighborhood  of  Ag  and  that  F'(Ag) 
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is  nonsingular  in  that  neighborhood,  and  the  proof  will  be 
complete . 

1.  Continuous  differentiability.  Continuous  first 
partial  derivatives  are  necessary  and  sufficient  for  con¬ 
tinuous  differentiability.  Thus  we  consider 

J=  PVA>  /9V  Sf0<A)/aAl'  •••  3fO(A,/3Xk 

L5;ic<a,",V  Sfk|,l/S,r  «»«*»/»\ 

where  ?f  ^  (/.) /I  >  ^  =  /g^  (x)  g^  (x)  p  (x)  dx ;  i , i=0 , 1 , . . . k . 

Clearly  p(x)  is  a  composite  of  the  continuous  g  (x)  and  is 
continuous.  Integration  is  a  continuous  operation  and  the 
products  of  continuous  functions  are  continuous.  Thus, 
each  element  of  J  is  continuous  and  F ( A )  is  continuously 
■  d f  ferent  table  . 

2.  Nonsingularity  of  F'(A)=J.  J  is  nonsingular 

at  .aq  if  the  determinant  of  J,jj|,  at  is  not  zero.  We 
claim  that  !cj^0  if  and  only  if  the  (x) ,  j=0,l,...k  are 
linearly  independent.  We  prove  the  contrapositive  of  this 
claim,  i.e.,  iJ'=0  if  and  only  if  the  g^ (x)  are  linearly 

dependent . 

( -> )  Assume  the  g ^ (x)  ,  j=0,l,...k  are  linearly 

leper  :  •  nt  and  show  ' J i =0.  Based  on  this  assumption,  there 

k 

-  constants,  a.,  not  all  zero,  such  that  I  a-g.(x)=0 
1  j=0  3  3 

for  ell  x  j.r.  |  n ,  b  1  .  We  r~uy  write  one  (x)  as  a  linear 
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combination  of  tne  others  and  rearrange  the  indexing  such 
that 


k-1  k-1 

g.  (x)  =  -  E  (a  ./a.  )  g. (x)  =  -  I  (x) 

K  j=0  3  X  3  j=0  3  3 

Now  consider  the  m*"*"1  row  of  J,  i.e.. 


[ ~/gm (x)  gQ(x)p(x)  dx, . . . ,~/gm(x)  gk(x)p(x)  dx]  . 

We  substitute  for  the  kfc^  element  of  this  mth  row  (i.<-., 
the  kth  column  entry) : 


k-1 

-/ gm  (x)  g.  (x)  p  (x)  dx  =  -/ g  (x)  (-1  6^g^  ( x )  )  p  (x)  dx 
m  x  m  j  _  q  J  J 

k-1 

=  l  Bs/g_lx)  g  .  (x)  p(x)  dx 
j=0  3  m  3 


The  last  summation  equates  to  a  linear  combination  of  the 
other  (k-1)  columns.  This  procedure  holds  for  all  rows. 
Thus,  the  kth  column  is  written  as  a  linear  combination  of 
the  first  (k-1)  columns  and  |J|=0. 

(b)  Assume  |j|=0  and  show  that  the  g^ (x)  must 
be  linearly  dependent.  Since  |j|=0,  then  the  columns  (or 


rows)  of  J  are  linearly  dependent.  Thus  we  have  constants, 

k 

a.,  not  all  zero,  such  that  -  1  a./g  (x)g.p(x)  dx=0.  It 
3  k  j  =  0  3  111  3 

follows  that  -/ g  (x)  (  I  cx.gdx))  p(x)dx=0  for  all  m  (i.e., 
m  j  =  0  3^ 

for  all  rows)  and  hence 
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k  k 

-  T  a_/g_  ^  aSg.,(x))p(x)dX  =  0  ,  Or 

m=0  m  m  j=0  3  3 


k  k 

-f  (  -  a  q  (*)  )  (  *-  asgj(x))  p  (x)  dx  =  0  and 
m=0  m  m  j  =  0  3  3 


k 

-/  (  Z  0.  g  (x)  )  2  p  (x)  dx  =  0  . 

j  =  0  3  3 


We  know  that  p(x)=exp[-' 

t.hus  ptx)  0  for  all  x  in 
k 

(  1  u  . q .  (x ) i  ' =0  a .e.  for 
1  =  0  3  3 
k 


q9q  (x)  -X  g  (x)  -  .  .  .  -1.  c.  {>. )  ]  and 
[a,b]  where  a  -“  .  Thus 
the  last  equation  to  hold  or 


i  n  .  a  .  ( x )  =  0  a 
j  =  0  33 

of  the  <~  .  ( x '  . 

J 

are  considered 
66:112) .  Thus 


e.  which  is  a  statement  of  linear  dependence 
We  remember  that  two  function:,  in  L"  [a,b] 
equivalent  if  they  are  equal  a.e.  (Ref 
ij|=0  implies  linear  dependence  of  the 


a.,  (x)  . 

We  have  shown  that  J  is  nonsingular  and  the  condi¬ 
tions  for  application  of  the  Inverse  Function  Theorem  are 
satisfied.  The  proof  of  Theorem  4.3  is  complete. 


Suf f ic ient  Statistics.  Our  final  theorem  concerns 
the  concepts  of  complete,  sufficient  statistics  for  solu¬ 
tion  of  the  vector.  The  theorem  shows  us  that  the 
aveiage  values  of  our  information  functions  (which  approxi¬ 
mate  expected  values  in  our  work)  contain  all  the  infor¬ 
mation  ot  the  random  sample  which  was  used  to  generate  the 
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values,  i.e.,  information  is  not  lost.  Further,  the 
average  values  contain  sufficient  information  ior  esti¬ 
mating  vector  A.  The  theorem  is  a  special  case  of  a 
theorem  presented  in  Hogg  and  Craig  (Ref  39:232)  for  the 
"regular  exponential  class”  of  probability  densities.  We 
restate  the  Hogg  and  Craig  results,  in  our  terminology  for 
our  special  case,  as  Theorem  4.4. 

Theorem  4.4.  Let  the  entropy  density  be  of  the  form 
k 

p (x) =exp [-  1  X  .  g .  (x) ]  f or  x  in  [a,b]  and  p  (x) =0  for  all 
i=0  1 

other  x  with  gQ(x)-l  and  linearly  independent  g^x), 
i=0,l,...k.  Given  a  random  sample  (x^ , X£ , . . . xN>  with 
N'-k,  then  the  functions 

N 

V.  =  X  g.  (x  .)  /N,  i  =  0 , 1 ,  .  .  .k, 

1  3=1  1  3 

arc  complete  sufficient  joint  statistics  for  determining 

T 

the  vector  A= ( Xq, , . . . X^)  . 

Section  Summary .  In  this  section  we  have  derived 
the  form  of  the  density  family  which  maximizes  the  entropy 
while  satisfying  given  constraints.  We  have  shown  that 
only  one  density  will  maximize  the  entropy,  if  a  solution 
density  exists,  although  numerical  solution  of  the  con¬ 
straints  may  lead  to  a  local  maximum.  We  related  existence 
to  the  correct  selection  of  information  functions. 

Finally,  we  have  shown  that  the  average  values  of  the 
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information  functions  provide  all  the  information  needed 
to  select  the  one  entropy  density.  That  is,  the  average 
values  comprise  all  available  information  that  can  be  pro¬ 


duced  by  a  random  sample  (x^,x2» . . •xN)  of  the  unknown  den¬ 
sity.  We  now  direct  our  attention  to  application  of  the 
above  theory  in  terms  of  numerical  solution  of  the  con¬ 
straints  . 

Numerical  Solution  Scheme 


Key  tc;  a  General  implementation  of  the  entropy 


x  i  is  .i  1 1  or i 

procedure  is  the 

ability  to  find  the  correct 

r'o 

'  1 '  •  •  •  k 

)  for  a  given  set 

of  expected  values,  i.e., 

.-oluc 

ion  of  th 

c  constraints.  K 

e  restate  the  problem: 

f  1  n  d 

'  ^  P 

0'  1’ 

. ..'k)V  such  that 

4 

V' 

)  =  /  p  ( x )  dx 

-1.0  =  0 

fl<‘ 

)  -  /g  (x  >  p  (x)  dx 

-<  a  (x) >  =  0 

fk’(A 

)  =  /  (  X  )  D  ( x )  dx 

CO 

c 

ii 

A 

X 

* 

where 

p (x) =exp 

[“X0-  )'1gl (X)  -*  ‘  •“ 

Xkgk^x^  <g^(x)>  and 

the  form  of  g. 

(x)  known  for  i=l 

,2,...k.  The  (k+1)  con- 

straints  are  nonlinear  and,  except  for  a  few  restricted 
cases,  cannot  be  solved  directly  for  the  A  vector.  As 
previously  mentioned,  several  authors  discuss  iterative 
numerical  schemes  for  simultaneous  solution  of  a  system 
of  nonl’near  equations  (Refs  2;  27;  56;  67).  For  our 
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approach,  we  write  equations  4.8,  in  vector  notation,  as 
a  fixed  point  problem,  F(A)=0  where  F (A)  =  [f ^ (A) ,f  (  i) , . . . 

.  We  have  implemented  a  computer  program  which  suc¬ 
cessfully  applies  the  Newton-Raphson  successive  approxima¬ 
tion  procedure  to  the  above  fixed  point  problem. 

The  Newton  method  is  based  on  iterative  solution  of 
the  following  equation: 


<VW  =  F<V 


(4.9) 


where  A^  is  the  Lagrange  multiplier  vector,  A,  for  the  ntl1 

iteration  and  J  is  the  Jacobian  matrix  for  F(A  ).  An 

initial  guess,  A^,  is  selected  and  equation  4.9  is  solved 

for  A^.  The  scheme  repeats  for  A2' A3' * ' ,An,An+l'  until 

the  difference  (Aft-A  is  less  than  a  predefined  value, 

i.e.,  until  convergence  occurs.  The  actual  convergence 

criteria  for  our  program  requires  that  the  final  value 

of  each  element  of  the  (A  -A  , . )  vector  be  less  than  a 

n  n+1 

predefined  epsilon.  When  convergence  is  obtained,  Afi  con¬ 
tains  the  solution  values  of  A_. ,  j=0,l,...k.  The  program 
is  written  as  a  subroutine,  subroutine  ENTROP,  which 
requires  the  user  to  specify  the  following  items:  the  num¬ 
ber  of  active  information  functions,  k;  the  approximation 
bounds,  [a,b];  a  vector  to  identify  the  active  information 
functions;  and  a  vector  which  contains  the  average  or 
expected  values  of  the  active  functions.  The  potential  set 
of  information  functions  is  provided  as  a  set  of  numbered 
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external  functions,  i.e.,  F1,F2,...FM.  The  potential  set 
is  thus  easily  modified  without  access  to  the  subroutine. 
The  user  identifies  the  active  set  for  a  particular  problem 
by  specifying  the  respective  function  numbers.  The  pro¬ 
gram  is  currently  implemented  with  twelve  potential  func¬ 
tions  and  a  maximum  of  six  active  functions  (plus  gQ(x)=l). 
Larger  sets  can  be  accommodated  with  simple  program 
changes.  Subroutine  ENTROP  solves  equations  4.9  for  vector 
(An~A  j)  using  matrix  decomposition  and  two  programs  from 
the  International  Mathematical  and  Statistical  Libraries 
(IKSL).  All  integrations  for  production  of  the  Jacobian 
and  F  (A  )  values  are  accomplished  using  a  32  point  Gauss- 
Legendre  quadrature  program  from  a  local  library.  Once 
convergence  is  reached,  the  A  vector  is  returned.  Sub¬ 
routine  ENTROP  has  been  extensively  tested  with  very  posi¬ 
tive  results. 

Convergence  and  rate  of  convergence  of  the  Newton 
method  are  dependent  on  the  initial  guess,  Aq.  Theorems 
exist  which  address  convergence  of  iterative  schemes  in 
general  (Refs  13;  47)  and  the  Newton  method  explicitly 
(Ref  67) .  The  theorems  usually  specify  a  neighborhood 
about  the  solution  wherein  the  scheme  will  converge  if  the 
initial  guess  is  within  that  neighborhood.  Acton  (Ref  2), 
Collatz  (Ref  13),  and  others  present  examples  of  diver¬ 
gence  due  to  poor  initial  guesses.  As  expected,  subroutine 
ENTROP  is  sensitive  to  the  initial  guess  Aq,  and  a  poor 


initial  guess  will  cause  divergence  or  numerous  iterations 
for  convergence.  For  example,  we  consider  the  data  pre¬ 
sented  in  Table  IV. I.  The  data  pertains  to  the  output  of 
a  computer  simulation  which  will  be  discussed  in  Chapter  IX. 
We  wish  to  find  the  A  vector  for  the  four  information  func¬ 
tions,  FI,  F2,  F5,  F8,  where  the  average  values  and  bounds, 
[a,b] ,  are  computed  from  the  data.  Previous  application  of 
subroutine  ENTROP  for  other  combinations  of  information 
functions,  using  the  same  data,  indicated  that  AA  was  a 
feasible  starting  value  for  the  Newton  method.  However, 
the  large  value  of  XAg=374.114  produced  a  terminal  numeri¬ 
cal  error  in  ENTROP.  A  second  attempt  with  initial  guess 
Ag ,  where  ABQ=0.0  and  other  elements  of  Ag  were  equal  to 
A A,  failed  to  converge  in  35  iterations,  although  a  terminal 
error  did  not  result,  A  final  attempt  with  A^= (0, 0, . . . 0) 
converged  in  20  iterations.  Several  schemes  for  intelli¬ 
gent  selection  of  the  initial  vector,  Ag,  were  evaluated 
throughout  the  research.  The  one  scheme  that  converged 
for  every  test  on  [a,b] ,  where  a  solution  existed,  was  the 
initial  vector  of  all  zeroes.  Conceptually,  this  tells  us 
that  the  first  iteration  of  the  Newton  method  produces  a 
A^,  A1=-J (Aq)  _1F  (Aq)  ,  where  is  an  element  of  a  conver¬ 
gent  neighborhood  of  the  solution;  other  initial  guesses 
run  the  risk  of  missing  that  neighborhood.  (It  is  not 
known  why  this  occurs.) 
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TABLE  IV. I 


SIMULATION  DATA  AND  CONVERGENCE  COMPARISON 


Find  A  =  ,  A1 , , *g)  for 


p  (x)  =exp[-A0-X1Fl-X2F2->.5F5-*8FS]  on 


[a,b]  =  [5104.12,  8262.58]. 


Symbol 


Expected  Value 


.00457 


6492.26 


72050.87 


Function 


(x-x) /s 
(x-x)  2 /s 2 
In  (b-x) 
(x-x)  Vs" 
sample  mean 


var lance 


Test  Comparisons 


Initial  guess:  A  =(374.114,  0.0,  -.348,  35.707,  .0153) 
Result :  Terminal  numerical  error 


Initial  guess:  A  =(0.0,  0.0,  -.348,  35.707,  .0153) 
Result:  Failed  to  converge  in  35  iterations 


Initial  guess:  l\^  (0 .  ,  0.,  0.,  0.,  0.) 

Result:  Convergence  in  20  iterations 

=  (-24.336  ,  .  636,  .  540,  4.125,  .  002) 


7.4647 


3.0168 


Subroutine  ENTROP  is  implemented  for  a  maximum  of 
six  active  information  functions  plus  the  normalizing  func¬ 
tion  gg(x)=l.  The  number  of  active  functions  is  restricted 
for  two  reasons.  First,  the  Newton  method  becomes  compu¬ 
tationally  cumbersome  as  the  number  of  constraints,  i.e., 
the  number  of  active  information  functions,  increases. 

The  primary  numerical  difficulty  centers  on  the  symmetric 
Jacobian  matrix,  J,  which  was  discussed  in  the  proof  of 
Theorem  4.3: 

J  «  [?f^  (A) /»Xi)  j,i=0,l,...k. 

We  demonstrate  the  potential  numerical  difficulty  by 
relating  the  initial  research  which  considered  moments 
about  zero  as  information  functions,  i.e.,  gi(x)=x1.  The 
Jacobian  in  this  case  follows: 

-/ p(x)  dx  -/xp(x)  dx  ...  -fx  p(x)  dx 

K+l 

J  =  -/xp(x)  dx  -/x2p(x)  dx  ...  -/x  p(x)  dx 

~/xkp(x)  dx  -/xk+^p(x)  dx  ...  -/x2kp (x)  dx 

As  k  increases,  or  as  the  interval  of  integration,  [a,b], 
increases,  the  elements  in  the  k^*1  column  (or  row)  will  be 
much  larger  than  elements  in  the  first  column  (or  row) . 
Couple  this  structure  with  numerical  error  and  limited 
machine  precision,  and  we  have  created  a  matrix  which  the 


computer  interprets  to  be  "singular."  Such  ill-conditioning 
is  not  restricted  to  moments.  The  example  of  Table  IV. I 
with  large  initial  value  for  Xq,  Xq=374.114,  produced  the 
same  difficulty.  Hornbeck  (Ref  40)  discusses  ill- 
conditioning  and  suggests  means  of  circumventing  the 
effects  of  an  ill-conditioned  matrix.  Scaling  and  the  use 
of  double  precision  computation  will  delay  the  impact  of  an 
ill-conditioned  matrix.  Ill-conditioning  is  controlled  in 
the  entropy  procedure  by  restricting  the  number  of  active 
information  functions  and  normalizing  functions  when  neces¬ 
sary.  For  example,  functions  g^ (x) = ( (x-U ) /o ) 1  or 
a.  (x)  =  (x-;j)1  are  used  in  place  of  moments  about  zero, 
c;.  ( y.)  —  x 1 ,  wnere  V  is  the  mean  and  o  is  the  standard  devi- 
a  t ion . 

A  second  reason  for  limiting  the  number  of  active 
information  functions  is  the  desire  to  produce  a  meaning¬ 
ful  and  usable  closed  form  for  p(x),  the  approximation 
density.  If  the  number  of  functions  used  in  a  specific 
representation  of  p(x)  is  large,  i.e.,  greater  than  six, 
then  the  practicality  of  the  entropy  method  is  reduced. 

If  six  functions  are  not  enough,  then  we  should  consider 
whethi-r  we  have  included  the  correct  potential  functions. 

Se  ion  of  potential  and  active  information  function 
sets  is  the  subject  of  the  following  four  chapters.  The 
pot  err  ini  set.  defined  in  Chapter  V  has  produced  excellent 
rr-.-u’ts  f or  a  variety  of  sample  distributions  and  has 


always  required  five  or  fewer  active  functions.  An  upper 
bound  of  six  active  functions  is  conceptually  reasonable, 
and  experimentation  confirms  this  limit. 

Although  the  Newton  method  as  implemented  in  ENTROP 
has  succeeded  in  all  tests  on  [a,b]  with  initial  guess 
Aq=(0,0, . . .0)  ,  convergence  is  not  guaranteed.  Other 
numerical  schemes  exist  which  can  be  applied  when  neces¬ 
sary.  An  effective  though  slower  method  for  solving  the 
constraint  equations  is  a  method  which  Acton  (Ref  2)  has 
named  the  "curve  crawler."  This  approach,  for  three  con¬ 
straints,  is  initiated  by  solving  the  first  constraint, 
i.e.,  fg(A)=0  of  equations  4.8,  for  vector  A.  Small  steps 
are  then  taken  along  the  surface  of  fQ(A)=0,  in  the  nega¬ 
tive  gradient  direction,  while  seeking  the  zero  of  the 
second  constraint,  i.e.,  f^(A)=0.  The  method  proceeds  by 
staying  close  to  the  f q ( A ) =0  and  f^(A)=0  curve  and  seeking 
the  A  which  zeroes  f 2 ( A) =0 .  The  method  was  implemented  by 
Orr  (Ref  63)  in  work  on  the  [O,00)  interval  and  is  explained 
in  detail  by  Acton.  Our  success  with  ENTROP  and  the  Newton 
method  made  the  development  of  a  backup  method  unnecessary. 
The  "curve  crawler,"  a  gradient  descent  approach,  is  sug¬ 
gested  as  an  alternative  should  the  Newton  method  fail. 

Chapter  Summary 

In  this  chapter,  we  have  presented  the  theoretical 
development  of  the  entropy  method  to  include  existence  and 
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uniqueness  discussions.  Solution  of  the  constraints  (equa 
tions  4.8)  was  discussed  in  both  theoretical  and  applica¬ 
tions  settings.  An  effective  subroutine  to  solve  the  con¬ 
straints  has  been  developed,  tested,  and  briefly  discussed 
Using  the  information  function  set  of  Chapter  V,  the  sub¬ 
routine  has  been  tested  against  sample  densities  of  the 
following  forms  on  interval  [a,b]:  normal,  beta,  gamma, 
exponential,  uniform,  Weibull,  mixtures  of  the  pre¬ 
ceding  densities,  and  unknown  samples.  The  routine  pro¬ 
duced  exact  results,  where  results  were  known  ahead  of 
time,  and  statistically  accectable  results  for  unknown) 
distributions.  The  routine  converged  for  -ve-y  test  with 
an  initial  guess  of  .1 Q  =  ( 0  ,  .  .  .  0 )  .  '’’his  subroutine  is  the 
key  element  to  machine  implementation  of  the  entropy  char¬ 
acterization  method.  However,  the  accuracy  of  the  charac¬ 
terization  method  in  representing  unknown',  distributions 
centers  on  selection  of  the  correct  information  functions. 
The  next  four  chapters  address  information  function  selec- 
t  ion . 
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Chapter  V.  Potential  Information  Functions 


Information  Functions 

Given  random  variable  X,  we  wish  to  approximate 
the  distribution  of  X  based  on  available  (or  computable) 
information.  The  information  will  be  collected  in  terms 
of  average  or  expected  values  of  certain  functions  of  X; 
we  call  these  functions  "information  functions."  By  speci¬ 
fying  the  expected  values  of  k  information  functions, 
g^(x),  i=l,2,...k,  and  applying  the  maximum  entropy  pro¬ 
cedure  of  previous  chapters,  we  obtain  p(x)  an  approxima¬ 
tion  to  the  unknown  density  of  X,  where  f (x)  is  the  unknown 
density  and 

p(x)  =  exp[-XQ-X1g1(x)-.  .  .  >-kgk(x)  )  (5.1) 

Clearly,  the  number  and  forms  of  information  functions  will 
impact  the  accuracy  of  approximation. 

To  demonstrate  the  importance  of  proper  informa¬ 
tion  function  selection,  we  use  the  moments  about  zero  as 
information  function*-.  Consider  the  beta  distribution  on 
[0,1]  : 

f  (x)  =  C  xP_1  (1-x)  Q_1  (5.2) 

where  C=F (P+Q) /  ( ~  (P) T (Q) ) ,  and  P  and  Q  are  the  beta  param¬ 
eters.  We  first  use  equation  5.1  with  k=l;  that  is,  our 
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total  available  information  consists  of  the  average  values 
of  gQ  (x)  and  g^x)  where  gQ(x)=l  and  g1(x)=x.  As  previously 
discussed,  the  function  gQ(x)  corresponds  to  the  constraint 
that  p(x)  be  a  density,  i.e.,  /p(x)  dx=l.  The  resulting 
entropy  approximation,  given  <g1(x)>,  is  denoted  p1 (x)  where 

р,  (x) =exp(-Xp-A  (x) ]  on  [0,1].  Figure  5.1  displays  f (x) 
fP=4,  Q=2)  and  (x)  to  illustrate  the  error  of  approxima¬ 
tion.  Now  consider  collecting  additional  information  in 
terms  of  <g2  (x)  >=<x‘  to  find  our  second  approximation, 

P-,  (x)  =exp  [  -  q- a^x- ”2X" )  •  Notice  that  .  represents  the 

•  h 

i  Lacrange  multiplier  m  each  entropy  characterization ; 
however,  the  Lagrange  multipliers  in  one  representation  are 
net  related  to  (and  need  not  equal)  the  multipliers  in  subse¬ 
quent  characterizations.  Figure  5.2  demonstrates  that  the 
increased  information,  i.e.,  the  second  moment,  has  improved 

с. ur  approximation.  Additional  information,  in  terms  of 
additional  moments,  continues  to  improve  the  approximation 
(Figures  5.3,  5.4  and  Table  V.I).  It  can  be  showrni  (Ref  90) 
that  p;.  (x)  -*f  (x)  as  k-*-*.  Table  V.I  shows  that  at  k=6  we 
am  approaching  an  acceptable  numeric  approximation. 

However,  the  moment  approach  presents  two  signifi¬ 
cant  oroblens.  First,  as  k  increases  or  as  the  interval 
of  interest,  [a,b],  becomes  large,  solution  of  the  con¬ 
straints  for  the  Lagrange  multipliers  becomes  numerically 
intractable  (see  Chanter  IV).  Some  authors  feel  that 
"most"  well-behaved  distributions  are  "amply"  described  by 
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ENTROPY  OENSI 


S9m«A  A1ISN30 


Fig.  5.2.  Entropy  Approx imaticn  with  Two  Moments 


ENTROPY  DENSITY 


rji.-  . 
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Fiq.  5.3.  Entropy  Approximation  with  Three  Moments 


pproxinat.on  with  Four  Moments 


TABLE 


NOTE:  The  entropy  density  with  information  functions  ln(x),  ln(l-x)  repro¬ 
duces  the  analytic  density  to  machine  accuracy. 


the  first  four  moments  (Ref  49).  Pearson’s  "method  of 
moments"  classifies  distributions  based  only  on  the  first 
four  moments  (Ref  65).  However,  Wragg  and  Dowson  (Ref  90) 
indicate,  and  our  example  with  the  beta  distribution  implies 
that  a  general  characterization  scheme  would  require  a 
larger  number  of  moments.  Given  that  a  usable  value  of  k 
can  be  distinguished  and  that  the  moments  can  be  scaled  to 
allow  numerical  solution  of  the  constraints,  we  face  a 
second  problem.  The  second  problem  is  that  the  algebraic 
form  of  p(x),  with  moments  as  information  functions,  tells 
the  analyst  very  little  about  the  unknown  density,  and  p(x) 
may  be  computationally  difficult  to  handle  even  with  four 
moments.  For  example,  consider  the  beta  distribution  of 
equation  5.2  once  more.  Let  g1(x)=ln(x)  and  g2 (x) =ln (1-x) 
and  we  obtain  p (x) =exp[ -> 0~X^  In  (x)-!^  In  (1-x)].  Applica¬ 
tion  of  our  numerical  scheme  produces  Ag=-2 . 9 95732 3 , 

X^=-3.0,  and  X2=-1.0.  We  now  consider  the  form  of  p(x): 

p  (x)  =  exp  t  —  X q ]  exp  [-A^  In  (x)  ]  exp  (-A^  In  (1-x) ]  ,  or 
-A1  "A2 

p(x)  =  exp [- Aq ]  x  (1-x)  ,  and 

P ( x )  -  20.0  x3'0  (1-x)  . 

Thus  pfx)  exactly  equals  f  (x) ,  and  we  have  only  used  two 
information  functions.  Further,  the  algebraic  form  of  the 
"ntrrpy  density  tells  us  that  we  are  working  with  a  beta 
distribution.  Our  selected  functions  are  clearly  superior 


to  moments  in  this  example.  The  above  examples  illustrate 
the  importance  of  selecting  the  correct  information  func¬ 
tions  and  indicate  the  numerical  and  conceptual  deficiencies 
of  relying  solely  on  moments.  Orthogonal  families  of  func¬ 
tions  (Ref  15)  present  the  same  conceptual  and  numeric  prob¬ 
lems  as  observed  with  moments. 

The  procedure  defined  in  Chapter  III  presents  a 
viable  alternative  for  approximation  of  unknown  densities 
on  a  bounded  interval,  [a,b].  The  information  function 
selection  step  of  the  procedure  includes  two  phases.  In 
the  first  phase,  we  specify  a  large  set  of  potential  infor¬ 
mation  functions;  that  is,  linearly  independent  functions 
that  may  prove  useful  in  representing  distributions  on 
[a,b].  A  potential  set  that  has  proved  extremely  useful 
for  a  variety  of  unknown  densities  is  defined  in  the  next 
section  of  this  chapter.  The  procedure  is  designed  to 
allow  flexibility  in  definition  of  the  potential  set,  and 
this  flexibility  is  also  discussed.  In  the  second  phase, 
we  select  an  "active  set"  of  information  functions  from  the 
potential  set.  A  large  number  of  active  functions  leads  to 
more  accuracy  in  the  approximation.  However,  a  large  num¬ 
ber  of  functions  also  leads  to  numerical  difficulties  and  a 
loss  of  conceptual  significance  in  the  form  of  p(x).  Selec¬ 
tion  of  the  active  set  is  thus  a  compromise;  we  want  the 
active  set  to  be  as  small  as  possible  within  our  accuracy 
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restrictions.  Three  different  methods  for  selection  are 
discussed  in  subsequent  chapters. 

A  Potential  Information 
Function  Set 

Our  initial  approach  to  specifying  a  potential  set 
for  use  in  a  general  approximation  problem  centers  on  an 
investigation  of  named  distributions  (Refs  37;  47;  55;  60). 

We  consider  the  algebraic  form  of  various  well-known  dis¬ 
tributions  and  determine  what  information  functions,  if 
any,  will  produce  an  equivalent  entropy  density,  p(x).  In 
the  example  of  equation  5.2  we  saw  that  information  func¬ 
tions  In (x)  and  ln(l-x)  produced  a  beta  distribution  on 
[0,1];  that  is,  if  we  provide  <ln(x)t>,  <ln(l-x)>,  and  apply 

i 

the  entropy  procedure,  then  the  resulting  entropy  density  x 

will  be  a  beta.  If  we  specify  no  information  functions, 
i.e.,  only  a^(x)=l  on  [a,b],  then  the  resulting  entropy 
density,  p (x) =exp { —  X q )  ,  equates  to  the  uniform  density. 

A  list  of  the  more  well-known  distributions  and  the  result¬ 
ing  information  functions  is  shown  at  Table  V.II.  We 
reason  that  many  continuous  distributions  on  [a,b]  will  be 
closely  approximated  by  the  listed  distributions  or  some 
combination  of  these  distributions.  Using  Table  V.II,  we 
select  the  most  versatile  distributions  and  eliminate 
redundant  functions  to  produce  the  potential  set  shown  in 
Table  7. III.  Notice  that  Table  V.III  includes  the  first 
four  moments  which  we  also  represent  as  normalized  central 
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TABLE 


NOTE:  Some  of  the  listed  densities  will  require  a  normalizing  constant  when  restnc 
bounded  interval,  but  the  information  functions  will  be  unchanged. 


TABLE  V.III 


A  STARTING  POTENTIAL  SET 


Symbol 

Function 

Symbol 

Function 

FI 

x  or  (x-p) /a 

F6 

[In  (x 

-a )  ]  2 

F2 

x2  or  [  (x-y) /o] 2 

F7 

x  3  or 

[  (x-u) /o] 3 

F  3 

In  (x) 

F8 

x“  or 

[  lx-v)/o] 

F  4 

In  (x-a ) 

F9 

In  (x2 

+1) 

F5 

In (b-x) 

moments,  i.e.,  g2  (>:)  =  2  /o‘  where  u  is  the  calculated 

mean  and  a  is  the  standard  deviation.  Normalization  was 
needed  to  provide  numerical  stability  for  a  specific  simula¬ 
tion  application  on  [5104.0,  8262.0].  Normalization  is 
effective  on  large  intervals,  [a,b],  but  may  produce  the 
opposite  result  if  b-a<l .  On  small  intervals  normalized 
moments  involve  small  values  divided  by  small  values  which 
w ill  lead  to  numerical  instability.  Thus,  origin  or  central 
moments  are  more  effective  on  small  intervals. 

The  functions  in  Table  V.III  are  not  intended  as 
the  ultimate  potential  set  but  will  serve  as  an  excellent 
startin'*  point  for  any  characterization.  Functions  can 
and  should  be  added  to  this  set  (or  deleted)  based  on  data 
anaJysis  for  a  particular  problem.  As  an  example,  we  con- 
rid:;  a  distribution  that  was  first  investigated  by  Chanda 
,-.d  K.:lp  (Ref  II).  The  data  consists  of  2000  samples, 


x^,  i=l ,2 , . . . 2000 ,  from  an  unknown  distribution.  We  trans¬ 
late  the  data  from  [-.4894,  .5028]  to  (.0106,  1.003]  to 
preclude  difficulty  with  the  natural  logarithms  in  Table 
V.III.  We  compute  average  values  from  the  sample  as 
explained  in  Chapter  IV.  Using  the  potential  set  of  Table 
V.III,  we  select  the  active  set  with  method  three  of 
Chapter  VIII.  The  resulting  "best"  fit  required  six  active 
functions  and  is  shown  in  Figure  5.5.  The  sample  density 
is  also  shown  and  was  created  by  sorting  the  2000  deviates, 
creating  the  cumulative  at  each  sample  point,  CUM^=i/2000, 
i=l , 2 , . . . 2000 ,  and  numerically  differentiating.  The  initial 
approximation  missed  the  peaked  structure  of  the  sample 
which  indicates  that  we  failed  to  specify  sufficient  infor¬ 
mation.  The  peaked  sample  suggests  the  shape  of  a  double 
exponential  density,  and  we  thus  add  information  functions 
|x-.5]  (the  .5  accounts  for  translation  of  the  data)  to  the 
potential  set.  Application  of  method  three  resulted  in 
four  active  information  functions  with  the  excellent  char¬ 
acterization  in  Figure  5.6.  Thus  the  nature  of  the  data 
suggested  the  addition  of  a  function  to  the  potential  set, 
and  that  function  was  subsequently  selected  as  active. 

This  example  again  illustrates  the  importance  of 
proper  information  function  selection  but  also  highlights 
the  flexibility  of  the  procedure.  The  characterization 
procedure  was  designed  as  a  tool  for  the  analyst.  Conse¬ 
quently,  data  analysis  and  an  analyst's  insight  can  be  used 
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to  enhance  the  procedure,  particularly  in  the  information 
function  selection  phases.  The  potential  set  of  Table 
V.III  has  been  tested  on  a  variety  of  unknown  densities 
and  produced  excellent  results.  If  the  potential  set  con¬ 
tains  a  sufficient  mixture  of  functions,  then  subsequent 
procedural  steps  will  eliminate  unnecessary  functions  and 
choose  the  useful  functions,  i.e.,  the  active  set. 

Ac f :!  ve  Information  Function  Set 

Selection  of  the  active  information  functions  from 
1 '  e  potential  sot  is  the  subject  of  the  next  three  chapters, 
."he  active  set  was  previously  defined  as  that  subset  which 
is  used  in  the  entropy  approximation  for  a  specific  set  of 
data.  The  90a i  of  the  selection  procedure  is  to  pick  the 
minimum  subset  which  conveys  enough  information  to  provide 
.in  acceptable  approximation  to  the  unknown  density,  f  (x)  . 
Selection  of  the  nct.jve  set  depends  on  how  one  defines 
"acceptable  approxinat ion"  and  how  one  measures  "closeness” 
of  aprroximat ion .  Three  different  approaches  to  the  prob¬ 
lem  resulted  in  three  viable  methods,  each  with  different 
qualities.  The  methods  were  tested  by  generating  data  from 
known  distributions  and  evaluating  the  resulting  approxi- 
-.nt  Lon  .  If  the  potential  set  includes  the  correct  infor- 
■  it i  v  functions  for  a  sample  density,  then  the  selection 
vooed  re  should  select  those  functions  and  produce  an 
t  fit.  For  exair.pl  e ,  if  the  data  is  from  a  normal 
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density  then  the  selection  method  should  select  x  and  x2 . 
For  the  beta  distribution  on  [0,1],  we  expect  ln(x)  and 
ln(l-x)  to  be  selected.  If  the  correct  information  func¬ 
tions  are  not  in  the  potential  set,  then  we  wish  to  select 
the  best  subset  to  approximate  f(x).  The  three  selection 
methods  produce  excellent  approximations  as  will  be  demon¬ 
strated.  The  choice  of  method  for  a  particular  approxima¬ 
tion  problem  will  depend  on  the  available  information, 
accuracy  of  the  information,  and,  in  some  cases,  the 
analyst's  preference. 
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Chapter  VI .  Active  Set  Selection — 


Method  One  (Regression) 

Introduction 

The  concepts  of  potential  and  active  information 
function  sets  were  discussed  in  Chapters  III  and  V.  One  of 
three  general  methods  to  select  the  active  set  for  a  spe¬ 
cific  density  approximation  is  presented  in  this  chapter. 
The  approach  is  based  on  linear  regression  and  requires  a 
random  sample  of  the  unknown  distribution,  x^,  i=l,2,...N. 
We  first  present  the  procedural  steps  for  method  one  and 
follow  with  detailed  discussion  of  a  few  of  these  steps. 
Sample  applications  are  then  presented  to  demonstrate 
method  strengths  and  sensitivities.  The  excellent  results 
of  method  one  led  to  alternate  methods  as  presented  in  sub¬ 
sequent  chapters. 

Method  One  Procedure 

Method  one  includes  five  procedural  steps: 

1.  The  first  step  is  generation  of  the  sample 
cumulative  distribution.  The  sample,  x^,  i=l,2,...N,  is 
sorted  and  the  cumulative  distribution  is  approximated  at 
these  N  points;  CUM^=i/N,  i=l,2,...N.  Clearly,  as  N 
increases,  the  approximation  becomes  more  accurate.  The 
cumulative  data  is  grouped,  for  large  N,  to  produce  better 
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results  in  subsequent  steps;  for  example,  with  N=500  we 
use  every  tenth  x^  to  reduce  the  data  from  500  points  to 
51  points  with  CUM ( j-1 ) 10 /N ,  j=l , 2 , . . . 51 . 

2.  Step  two  produces  a  numerical  estimate  for 
f (x) ,  the  unknown  density.  We  numerically  differentiate 
the  sample  cumulative  to  produce  a  numerical  density  at  M 
points,  DCtT  ,  i=l,2,...M,  M£N .  Numerical  differentiation 
is  an  ill-posed  problem  (Refs  36;  53)  and  requires  care  in 
nppl ication .  Our  work  with  differentiation  techniques 
produced  interesting  results  which  are  presented  later  in 
"his  chapter.  The  initial  approach  to  numerical  differen¬ 
tiation  was  central  difference; 


DEN.  -  lCUMi+1-CUM._1]/;x 


i+l'Xi-l1,  1=2'4' • • • (M-l) . 


3.  The  third  step  produces  the  natural  logarithm 
of  the  numerical  density;  ln(DEN^),  i=l,2,...M.  The  pur¬ 
pose  of  this  step  is  to  establish  a  linear  relationship 
between  the  numerical  density  and  the  entropy  density.  We 
;eek  the  minimum  set  of  information  functions,  g^ (x)  , 

1=1, 2,... k,  which  are  essential  for  accurate  approximation 
of  f  (x)  .  Let  the  entropy  density  include  all  the  poten- 
"ill  functi on s ,  i . e . , 


V'Vi(x' 


X  g  ( x ) 
m  ^m 


V-re  n.  is  the  number  of  functions  in  the  potential  set, 
m.  Wo  establish  the  linear  relationship  as  follows: 
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(6.1) 


lnf(x)=  lnp(x)  =  .  (x)-...A  o  (x) 

u  i  i  mm 

We  now  have  a  form  which  allows  linear  regression,  and  we 
have  M  data  points  (ln(DEN^))  to  approximate  In  (f (x) ) . 

4.  The  fourth  step  is  to  apply  linear  regression 
to  identify  several  possible  sets  of  active  information 
functions.  This  regression  step  reduces  the  number  of  com¬ 
binations  of  potential  functions,  i.e.,  subsets  of  the 
potential  set,  that  will  be  considered  in  selection  of  the 
active  set.  We  consider  sets  of  five  or  fewer  functions 
for  reasons  presented  in  previous  chapters. 

5.  The  final  step  is  selection  of  the  active  set 

from  the  sets  defined  in  step  4.  A  measure  of  "goodness  of 

fit"  is  specified  and  the  active  set  is  the  set  whose  cor¬ 

responding  entropy  characterization  provides  the  "best" 

fit  to  the  data. 

The  above  procedure  has  produced  excellent  results 
as  will  be  shown.  Currently,  the  procedure  is  implemented 
in  two  separate  packages  to  allow  maximum  analyst  involve¬ 
ment.  The  first  package  accomplishes  steps  1  thru  4  and 
returns  10  candidate  sets  of  functions.  The  analyst  may 
use  some  or  all  of  the  10  sets,  or  other  combinations  of 
functions,  as  input  to  the  second  program  which  accomplishes 
step  5.  In  our  examples  we  will  use  only  5  of  the  10  candi¬ 
date  sets.  Steps  2,  4,  and  5  are  discussed  in  more  detail 

in  the  following  sections. 
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Numerical  Differentiation 

Method  one  is  a  straightforward  application  of  well 
known  numerical  and  statistical  techniques;  however,  ini¬ 
tial  testing  indicated  a  sensitivity  to  numerical  and 
sampling  errors.  The  ill-posed  nature  of  numerical  differ¬ 
entiation  can  result  in  exaggerated  numerical  error,  thus 
various  differentiation  schemes  were  investigated  to  reduce 
this  risk.  The  investigation  resulted  in  a  previously 
unpublished  scheme,  the  "median"  method,  which  generally 
outperformed  other  schemes  for  our  application.  A  summary 
of  the  investigation  follows. 

Polynomial  and  Spl ine  Approximations.  Inter¬ 
national  Mathematical  and  Statistical  Libraries  (1MSL)  sub¬ 
routines  were  used  to  fit  a  polynomial  to  the  sample  cumula¬ 
tive  and,  subsequently,  to  differentiate  the  polynomial. 
Polynomials  of  up  to  sixth  degree  were  tested  but  produced 
poor  results.  Polynomial  "wiggle"  (Ref  40)  caused  negative 
density  values.  Spline  approximation  and  spline  interpola¬ 
tion  (Ref  5)  were  attempted  with  existing  IMSL  software  in 
an  effort  to  reduce  the  polynomial  "wiggle."  Differentia¬ 
tion  of  the  spline  also  produced  negative  density  values 
at  a  few  points  and  proved  unsatisfactory. 


Sliding  Polynomial .  A  program  was  written  which 
makes  a  least  squares  fit  to  five  data  points  using  a  second 
decree  polynomial .  Beginning  with  the  first  five  data 


points  (x^  and  CUM^,  i=l , 2 , 3 , 4 ,  5) ,  the  program  produces 
a  second  degree  polynomial  fit  to  the  cumulative  data  and 
uses  the  polynomial  coefficients  to  calculate  the  deriva¬ 
tive  at  the  middle  point,  i.e.,  at  x^.  The  program  then 
advances  the  operative  window  by  one  data  point  and  repeats 
the  procedure  for  i=2,3,4,5,6  to  find  the  derivative  at 
x^.  The  procedure  continues  in  this  fashion  to  produce 
DEt'K,  i=3 , 4  ,  .  .  .  (N-2 )  .  Forward  and  backward  difference 
formulas  are  used  for  the  first  two  and  last  two  data 
points.  A  second  program  was  written  to  accomplish  the 
identical  procedure  but  using  seven  data  points  instead  of 
five.  The  intent  of  using  seven  points  is  to  provide  more 
of  a  smoothing  effect  on  the  data.  The  seven-point  formula 
did  produce  a  more  accurate  derivative  than  the  five-point 
formula,  and  both  schemes  generally  outperformed  the  central 
difference  approach. 

Sliding  Median .  The  median  method  is  based  upon  a 
nonparametric  regression  parameter  estimator  which  was 
first  suggested  by  Theil  (Ref  79) .  The  distribution  of 
this  estimator  was  investigated  by  Sen  (Ref  69)  and  Chanda 
and  Kulp  (Ref  11).  To  our  knowledge,  this  scheme  has  not 
been  previously  used  for  numerical  differentiation.  As  in 
the  polynomial  approach,  we  define  an  operating  window 
about  the  first  seven  data  sets,  x^  and  C(JM^,  i=l,2,...7. 

We  then  consider  all  combinations  of  these  seven  distinct 


data  sets,  taking  two  sets  at  a  time,  and  calculate  the 
value  [CUM^-CUMj] / [xk~Xj]  for  each  combination.  This  value 
equates  to  the  central  difference  at  the  point  midway 
between  x ^  and  x^.  The  values  for  the  21  combinations  are 
then  sorted,  and  DEN ^ ,  the  density  at  point  x^,  is  assigned 
the  median  value.  The  operating  window  is  advanced  one 
data  point,  and  the  procedure  repeats  to  find  DIN^.  We 
iterate  for  i=4 , 5 , . . .  (N-3) .  Forward  difference  is  used  to 
find  the  density  at  x^ ,  central  difference  for  points  x^, 

X3 '  XN-2 '  *N-1'  and  backward  difference  for  point  x^ .  A 
similar  program  was  created  for  an  operating  window  of  only 
5  data  points.  As  in  the  polynomial  case,  the  7  point 
formula  performed  better  than  *-ha  5  point  formula.  The 
simplicity  of  the  median  method  resulted  in  faster  computa¬ 
tion  than  the  polynomial  approach. 

The  listed  methods  were  tested  against  sample 

cumulatives  from  known  distributions  and  known  densities, 

M 

i (x) .  The  sum  of  errors  squared,  SE  =  I  (f (x • ) -DEN . ) 2 , 

i=l  1  1 

and  mean  squared  error,  SE/M,  were  calculated  for  compari¬ 
son.  Three  example  distributions  are  provided  in  Table 
VI. I .  Each  sample  set  in  Table  VI . I  is  composed  of  500 
deviates  from  the  stated  distribution.  The  sample  dis¬ 
tributions  are  further  described  in  Table  VI. II.  The  500 
deviates  were  grouped  to  M  data  points  before  differen¬ 
tiating.  The  first  three  comparisons  demonstrate  the 
effect  of  qrouping  data  from  a  given  sample  set.  The  last 


TABLE  VI. I 


NUMERICAL  DIFFERENTIATION  SCHEMES 


Sample 
Set  No. 

Distribution 

M 

Scheme 

Time 

(sec) 

(SE)2 

(SE)Z/M 

1 

n  (10,2) 

500 

Central  Dif. 

230.13 

.460 

1 

n(10,2) 

500 

Poly.  5 

.506 

49.9 

.0998 

1 

n(10,2) 

500 

Poly.  7 

.572 

12.02 

1 

n (10,2) 

500 

Median  5 

.211 

20.51 

1 

n(10,2) 

500 

Median  7 

.528 

8.80 

.0176 

1 

n  (10,2) 

101 

Central  Dif. 

.637 

1 

n(10,2) 

101 

Poly.  5 

.098 

.322 

.00319 

1 

n (10,2) 

101 

Poly.  7 

.113 

.180 

.00178 

1 

n(10,2) 

101 

Median  5 

.041 

.227 

.00225 

1 

n(10,2) 

101 

Median  7 

.096 

.164 

1 

n (10,2) 

51 

Central  Dif. 

.1210 

1 

n(10,2) 

51 

Poly.  5 

.046 

.0664 

1 

n(10,2) 

51 

Poly.  7 

.0472 

.000925 

1 

n (10,2) 

51 

Median  5 

.022 

.0560 

.001098 

1 

n(lC,2) 

51 

Median  7 

.019 

.0410 

.000804 

2 

n(10,2) 

51 

Central  Dif. 

.1021 

2 

n(10,2) 

51 

Poly.  5 

.048 

.0457 

.000895 

2 

n (10,2) 

51 

Poly.  7 

.055 

.0255 

.000500 

2 

n  (10,2) 

51 

Median  5 

.000776 

2 

n (10,2) 

51 

Median  7 

.039 

.000481 

3 

beta  P=4,  0=2 

51 

Centred  Dif. 

6.767 

.1327 

3 

beta  P=4,  0=2 

51 

Poly .  5 

.049 

3.579 

.0702 

3 

beta  P=4,  0=2 

51 

Poly.  7 

.056 

2.127 

3 

beta  P=4,  0=2 

51 

Median  5 

.019 

3.581 

.0702 

3 

beta  P=4,  0=2 

51 

Median  7 

2.375 

.0466 

NOTE:  Each  sample  set  includes  500  data  points  which 
were  grouped  to  M  points  before  differentiation. 

M 

( SE ) 2  =  Z  (Actual-Approx) 2 . 
i=l 
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two  comparisons  are  representative  examples  from  other 
samples.  The  point  of  significance  is  that  the  median 
method  generally  produced  closer  approximations  to  the 
known  densities  than  either  the  polynomial  or  central  dif¬ 
ference  methods.  The  median  approach  limits  extreme  values 
caused  by  numerical  and  sampling  error,  thus  producing  a 
closer  fit  to  the  true  analytic  density.  The  method  does 
not  eliminate  differentiation  "noise"  but  does  control  the 
magnitude  of  this  noise.  Figure  6.1  and  Figure  6.2  provide 
examples  of  the  densities  produced  by  the  median  method 
for  a  normal  distribution  (mean  10  and  variance  2)  and  a 
beta  distribution  (on  [0,1]  with  P=4,  Q=2) .  The  sliding 
median  method  with  7  data  points  was  used  for  all  subsequer 
numerical  differentiation  in  the  research. 

Regression 

We  use  linear  regression  in  step  4  to  reduce  the 
number  of  candidate  active  sets.  Linear  regression  is  a 
well  known  and  well  defined  analytical  tool.  Drapper  and 
Smith  (Ref  23)  and  others  (Refs  28;  37;  38)  provide 
detailed  explanation  of  regression  procedures,  regression 
statistics,  and  stopping  criteria.  Both  "stepwise  regres¬ 
sion"  and  "regression  by  leaps  and  bounds”  were  researched 
to  include  available  software  for  implementation;  Statis¬ 
tical  Package  for  the  Social  Sciences  (SPSS)  and  IMSL  sub¬ 
routine  libraries  contain  regression  packages.  The  IMSL 
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Numerical  Dif ferentiation--Data  Set 


leaps  and  bounds  package,  RLEAP ,  was  selected  for  our 
application  with  the  adjusted  R2  statistic  (Ref  38) .  We 
created  a  program  which  uses  RLEAP  to  return  ten  candidate 
active  sets  with  corresponding  adjusted  R2  statistics; 
i.e.,  the  two  best  sets  with  one  active  variable,  through 
the  two  best  sets  with  five  active  variables.  "Best  set" 
is  defined  as  the  set  with  the  largest  adjusted  R2  sta¬ 
tistic  . 

An  examination  of  our  regression  procedure  will 
identify  a  significant  benefit  of  our  entropy  approxima¬ 
tion  method.  We  use  regression  to  identify  ten  possible 
active  sets.  However,  we  could  use  the  same  regression 
packages  to  select  the  single  active  set  that  produces  thv 
best  regression  fit  to  the  data  by  choosing  the  one  set 
with  the  largest  adjusted  R* .  Regression  will  also  pro¬ 
duce  the  regression  coefficients,  A^,  i=0,l,...k,  for  our 
selected  set  to  completely  specify  p(x)  as  in  equation 
(6.1) .  Thus,  why  not  stop  at  the  regression  step?  The 
reason  for  step  5  cente-s  on  the  purpose  of  linear  regres¬ 
sion.  Regression  seeks  a  fit  to  the  sample  data  and  the 
p(x)  produced  in  regression  is  not  required  to  satisfy 
averane  value  or  density  constraints  (see  Chapter  IV). 

Thus,  the  regression  p(x)  need  not  be  a  density  function 
at  all  and  will  be  a  maximum  entropy  density  only  by  coinci¬ 
dence.  The  sole  intent  of  the  regression  procedure,  as  we 
have  defined  it,  is  to  identify  information  functions  that 
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play  a  key  role  in  describing  the  sample  cumulative.  Once 
regression  has  defined  several  candidate  sets  of  functions, 
we  return  to  the  entropy  approach  to  select  the  density 
(i.e.,  select  A ^ ,  i=0,l,...k)  which  satisfies  average 
value  constraints.  This  entropy  approach  is  thus  a  com¬ 
promise  between  a  fit  to  the  sample  data  and  constraint 
satisfaction.  One  may  view  constraint  satisfaction  as  an 
attribute  of  the  underlying  analytic  distribution  versus  a 
function  of  the  sample  distribution.  Consequently,  our 
approach  provides  a  compromise  between  the  unknown  analytic 
distribution  and  the  provided  sample.  The  examples  of  the 
next  section  will  demonstrate  this  quality. 

Experimental  Results 

The  strengths  and  sensitivities  of  method  one  are 
best  demonstrated  with  examples.  We  consider  the  first 
four  steps  of  method  one  in  this  section  and  subsequently 
discuss  stem  five.  Our  goal  is  to  select  the  minimum  set 
of  functions  (active  set)  which  produces  an  acceptable, 
closed  form,  entropy  approximation,  p(x),  to  the  unknown 
density,  f  (x)  .  To  test  method  one,  we  generate  random 
samples  of  size  500  from  known  distributions,  i.e.,  normal, 
beta,  and  gamma.  Thus  the  analytic  cumulative  and  density 
functions  are  available  for  comparison  to  sample  and  entropy 
distributions.  We  will  consider  the  three  sample  data  sets 
of  Table  VI. II  for  purposes  of  illustration.  The  normal 
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samples  were  produced  by  the  Box-Muller  method  (Ref  8) 
applied  to  uniform  deviates.  The  beta  deviates  were  pro¬ 
duced  via  an  IMSL  subroutine. 

We  first  discuss  the  sensitivity  of  method  one  to 
sampling  and  numerical  differentiation  errors.  Experi¬ 
mentation  with  normal,  beta,  and  gamma  distributions  has 
shown  that  if  the  actual  analytic  density  is  used  in  place 
of  the  numerical  differentiation  step  (i.e.,  sampling  and 
differentiation  variations  are  not  permitted) ,  then  the 
regression  step  will  select  the  "correct"  information  func¬ 
tions  with  an  exact  fit  to  the  data  (i.e.,  adjusted  R? 
equal  to  100) .  The  "correct”  functions  for  a  distribution 
are  the  information  functions  presented  in  Table  V. II,  i.e., 
x  and  x2  for  normal,  ln(x)  and  ln(l-x)  for  beta,  etc. 
However,  when  the  complete  procedure  is  applied,  i.e.,  a 
sample  cumulative  is  generated  and  differentiated,  the 
regression  step  does  not  necessarily  select  the  expected 
inf ormation  functions.  In  fact,  the  randomness  of  sample 
data  may  cause  selection  of  different  function  sets  when 
multiple  regression  is  applied  to  different  samples  from  the 
same  distribution.  The  interesting  point  is  that  which¬ 
ever  set  of  functions  is  selected,  excellent  approximations 
are  obtained.  Thus,  method  one  demonstrates  data  depen¬ 
dence  . 

Samples  one  and  two  of  Table  VI. II  for  the  normal 
distribution  provide  an  example  of  data  dependence.  We 
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use  the  potential  information  functions  of  Table  VI. Ill, 
where  u,  °2,  and  [a,b]  are  10,  2,[x^,xNl  respectively, 
and  apply  method  one.  Table  VI. IV  lists  five  candidate 
active  sets  that  result  for  each  sample,  El,  1=1, 2,... 5, 
where  I  represents  the  number  of  active  functions.  Notice 
we  have  included  only  five  candidate  sets;  the  analyst  may 
choose  to  consider  a  larger  number  of  sets  for  other  appli¬ 
cations.  For  our  potential  set,  F2=  (  (x-M)/c)  2  is  the  infor¬ 
mation  function  that  will  exactly  characterize  the  normal 
distribution,  and  F2  is  dominant  for  both  data  sets.  If 
we  next  solve  the  constraint  equations,  given  accurate 
expected  values  for  the  information  functions,  we  will  pro¬ 
duce  a"  exact  fit  to  the  analytic  distribution  for  any  set 
which  contains  F2.  Figure  6,3  provides  a  comparison  of 
sample  and  entropy  cumulatives  to  the  known  analytic  cumula¬ 
tive  for  active  set  El.  Entropy  and  analytic  cumulative 
values  were  computed  by  numerical  integration  of  respective 
densities.  Differences  in  cumulatives,  i.e.,  sample- 
analytic  and  entropy-analytic,  are  shown  to  facilitate 
comparison  and  because  the  distributions  are  very  close. 

The  entropy-analytic  curve  is  identically  zero  because  the 
entropy  approxima tion  provides  a  near  perfect  fit  to  the 
analytic  distribution.  Graphs  were  also  produced  for  sets 
E2,  E3,  E4  from  data  set  one  and  El  through  E5  for  data 
set  two.  The  graphs  were  nearly  identical  to  Figure  6.3 
because  F2  (the  correct  information  function)  was  part  of 
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TABLE  VI. Ill 


POTENTIAL  INFORMATION  FUNCTION  SET 


FI  =  (x-y) /o 

F  4  = 

In (x-a) 

*1 

II 

X 

•c 

Q 

U» 

F2  =  (  (x-y) /a) 2 

F5  = 

In (b-x) 

F8  =  ( (x-y) /o) " 

F3  =  In (x ) 

F6  = 

(In  (x-a) ) 2 

F9  =  In (x2  +1) 

NOTE:  y 

=  mean;  c 

-  standard 

deviation;  [a,b]  = 

bounds . 


TABLE  VI. IV 

REGRESSION  RESULTS  FOR  NORMAL  SAMPLES 


Candidate 

Set 

Functions  for 
Data  Set  #1 

Adjusted 

R2 

Functions  for 
Data  Set  #2 

Adjusted 

R2 

El 

F2 

91.76 

F2 

89.70 

E2 

F2,F6 

95.75 

F2,F8 

95.70 

E3 

F2,F6,F8 

96.97 

F2,F6,F8 

96.19 

E4 

F2,F4,F5,F6 

97.07 

F2,F3,F4,F5 

97.37 

E5 

Fl,F5,F6,F7,F9 

97.12 

F2 ,F4 ,F5,F8,F9 

98.66 

the  entropy  representation,  and  additional  functions  pro¬ 
vided  little  usable  information.  Set  E5  of  data  set  one 
does  not  include  F2  and  yet  provides  an  excellent  regres¬ 
sion  fit  to  the  data,  i.e.,  R2=97.12.  The  resulting 
entropy  approximation  also  produces  an  excellent  fit  to 
the  analytic  and  thus  the  sample  as  shown  in  Figures  6.4 
and  6.5.  Table  VI. V  provides  further  insight  with  a 
numerical  interpretation  of  the  entropy  approximations  pro¬ 
duced  by  El  through  E5  for  data  set  one. 

The  beta  distribution  on  [0,1]  provides  a  more 
revealing  example.  Because  we  are  on  the  [0,1]  interval, 
the  normalized  moments  in  our  potential  set  (Table  VI. Ill) 
are  replaced  with  moments  about  zero.  Table  VI. VI  repre¬ 
sents  the  results  of  applying  method  one  to  data  set  three. 
The  desired  functions  for  a  perfect  fit  to  the  analytic 
distribution  are  F3  and  F5.  Notice  from  Table  VI. VI  that 
only  E4  includes  F3  and  F5.  Figures  6.6  through  6.14  pre¬ 
sent  a  comparison  of  analytic,  sample,  and  entropy  densi¬ 
ties  and  cumulatives.  Notice  again  that  for  set  E4  method 
one  provides  an  exact  fit  to  the  analytic  distribution 
(Figure  6.9  and  6.13).  Sets  E5  and  E3  perform  quite  well 
even  without  the  desired  information  functions.  Table 
Vi. VII  provides  a  numerical  evaluation  of  the  entropy 
approximations . 
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Fig.  6.9.  Density  Comparison--E4 /Data  Set 
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Cumulative  Differences — E2/Data  Set 


ACTIVE  SET  E3 

SAMPLE-ANALYTIC 

ENTROPY-ANALYTIC 


ACTIVE  SET  E4 

SAMPLE-ANALYTIC 

ENTROPY-ANALYTIC 


Cumulative  Differences — E4/Data  Set 


RCTIVE  SET  E5 

SAttPLE-RNBLYTIC 

ENTROPY-ANALYTIC 


VI .VII 


A  review  of  Tables  VI. V  and  VI. VII  and  previous 
figures  will  consolidate  four  significant  points  concerning 
method  one: 

1.  The  method  is  sensitive  to  sample  data  and  may 
suggest  candidate  active  sets  that  do  not  include  the  ana¬ 
lytically  correct  functions.  The  reason  is  that  the  method 
provides  a  compromise  between  sample  and  analytic  distribu¬ 
tions  and  may  require  "other"  functions  to  accomplish  that 
compromise . 

2.  An  acceptable  approximation  to  the  unknown  dis¬ 
tribution  may  be  obtained  even  if  the  "correct"  functions 
are  not  part  of  the  active  set.  Set  E5  for  the  beta 
example  and  set  E5  for  the  normal  demonstrate  this  quality. 

3.  Given  accurate  expected  values,  inclusion  of 
the  analytically  correct  functions  in  the  active  set  will 
produce  an  exact  fit. 

4.  Finally,  the  information  functions  from  one 
sample  will  provide  excellent  approximations  for  subse¬ 
quent  samples.  The  two  normal  data  sets  exemplify  this 
quality.  Since  our  technique,  in  general,  approximates 
the  unknown  analytic  distribution,  and  the  sample  is  an 
approximation  to  the  analytic  distribution,  then  one  would 
expect  an  excellent  tit  to  subsequent  samples. 

Before  proceeding  to  final  selection  of  the  active 
.'-•■’t,  we  demonstrate  the  sensitivity  of  method  one  to  errors 
j:  the  calculation  of  expected  values  of  information 
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functions.  This  subject  is  further  pursued  in  Chapter  X. 
These  expected  values  determine  the  constraint  equations 
and  thus  determine  the  final  form  of  our  approximation, 
p(x).  Expected  values  may  also  be  involved  as  parameters 
in  the  potential  set  such  as  y  and  o2  in  Table  VI. III. 

The  above  examples  used  accurate  expected  values  which  were 
approximated  by  a  32  point  quadrature  formula.  We  might 
also  approximate  these  values  with  averages  from  the  random 
sample : 

500 

<g, (x) >  5  I  g  .  (x.) / 500 . 

3  i=l  3  1 

Table  VI. VIII  lists  the  average  and  quadrature  values  for 
our  three  sample  data  sets. 

As  one  would  expect,  use  of  averages  in  lieu  of 
the  more  accurate  quadrature  values  will  produce  a  subse¬ 
quent  change  in  the  entropy  approximation.  We  demonstrate 
an  interesting  result  by  using  the  average  values  for  the 
normal  sample,  data  set  one,  to  include  mean  and  variance 
values  in  the  information  functions.  The  entropy  procedure 
produced  a  less  accurate  fit  to  the  analytic,  as  expected, 
but  a  more  accurate  approximation  to  the  sample  data.  This 
is  demonstrated  in  Figures  6.15  and  6.16  which  graph  sample, 
analytic,  and  entropy  comparisons.  Notice  that  the  "entropy 
minus  analytic"  curve  follows  the  trend  of  the  "sample  minus 
analytic"  -,’alues.  Comparison  to  Figures  6.3  and  6.4 
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RCTIVE  SET  ES 

SRHPLE-RNRLYTIC 

ENTROPY-RNRLYTIC 


(quadrature  values)  exemplifies  this  stronger  correspon¬ 
dence  to  sample  and  weaker  relationship  to  analytic. 

Similar  results  were  obtained  with  beta,  gamma,  and  simula¬ 
tion  sample  densities. 

A  word  of  caution  is  needed  in  reference  to  the 
figures.  The  figures  were  designed  to  accentuate  the  dif¬ 
ference  in  distributions  and  not  the  closeness  of  approxi¬ 
mation.  Careful  attention  to  the  actual  values  of  differ¬ 
ences  between  distributions  or  review  of  Tables  VI. V  and 
VII  will  indicate  that  the  entropy  approximations  are  quite 
close  to  the  sample  and  analytic  distributions.  In  fact, 
the  candidate  sets  provide  such  excellent  approximations 
that  the  choice  of  the  single  best  active  set  is  difficult. 
Goodness  of  fit  statistics  are  used  in  this  final  step  of 
the  procedure. 

Goodness  of  Fit 

Experimental  results,  in  addition  to  the  above 
examples,  indicate  that  the  regression  procedure  will  pro¬ 
duce  adequate  fits  to  the  sample  data  and  accurate,  if  not 
exact,  fits  to  the  underlying  analytic  distribution  even 
when  the  selected  information  functions  are  not  those 
expected.  The  question  remains  as  to  how  one  selects  the 
best  set  of  information  functions  from  the  several  candi¬ 
date  sets.  Since  an  accurate  fit  to  the  sample  cumulative 
is  desired,  the  active  set  will  be  the  candidate  set  that 
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produces  the  smallest  error  between  sample  and  entropy 
cumulatives.  Again,  the  sample  cumulative  is  available  at 
N  sample  points,  and  the  entropy  cumulative  may  be  calcu¬ 
lated  by  integrating  the  entropy  density. 

Selection  of  a  measure  of  error  between  sample  and 
entropy  cumulatives  will  impact  selection  of  the  active 
set.  If  we  are  only  concerned  with  absolute  error  between 
the  distributions,  then  we  might  use  the  errors  squared 
measure  of  previous  examples  (Tables  VI. V  and  VII).  However, 
we  would  like  to  know  more.  Besides  producing  the  "best" 
fit,  we  would  like  to  know  "how  good"  that  fit  is  in  a  sta¬ 
tistical  sense.  A  goodness  of  fit  statistic  will  provide 
tf  •'  s  information.  Step  five  of  method  one  may  now  be 
stated  explicitly: 

5.  Identify  a  goodness  of  fit  statistic,  SK,  which 
is  a  function  of  sample  and  entropy  distributions;  calcu¬ 
late  SK  for  each  candidate  set;  select  the  set  with  minimum 
SK  as  the  active  set;  and  finally,  specify  the  level  of  sig¬ 
nificance  for  the  selected  SK.  We  thus  select  the  informa¬ 
tion  function  set  that  provides  the  best  fit  in  the  sense 
of  our  chosen  statistic. 

Several  goodness  of  fit  statistics  are  discussed 
m  Appendix  B.  Each  statistic  has  strengths  and  weaknesses 
as  indicated  in  the  appendix  and  references.  Different  sta- 
'  '*■  ■  's  may  result  in  different  active  sees.  Two  examples 

arc  presented  ir.  Table  VI .  IX  for  Anderson-Darling,  A2, 
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Kolmogorov-Smirnov,  D,  and  Cramdr  von  Mises,  W2 ,  statistics 
and  the  mean  error  squared,  M2  .  Consider  the  normal  example 
of  Table  VI. IX.  Since  El  through  E4  produce  identical 
results,  our  choice  of  active  set  is  a  decision  between  El 
and  E5.  Regardless  of  the  analyst's  choice  of  statistic, 
the  statistical  values  for  the  two  sets  are  very  close. 

The  analyst  would  probably  select  the  smaller  set,  El,  in 
this  case.  Notice  that  use  of  the  A2  or  W2  statistics 
result  in  acceptance  of  the  hypothesis  of  equal  distribu¬ 
tions  at  a  critical  value  of  a>.15.  The  D  statistic  results 
i  n  .t  ’  .  1 5  . 

The  beta  example  of  Table  VI. IX  better  demonstrates 
the  flexibility  of  method  one  and  the  importance  of  the 
choice  of  statistic.  We  see  that  E4  produces  the  smallest 
value  of  M2 ,  D  ,  and  K2 .  This  result  is  pleasing  in  that 
E4  contains  the  expected  information  functions  F3=ln(x) 
and  F5=in(l-x).  However,  a  concern  for  a  fit  to  the  tails 
of  the  unknown  distribution  may  force  the  analyst  to  use 
the  A2  statistic  (see  Appendix  B) .  Use  of  A2  will  result 
in  selection  of  set  E5.  Notice  that  both  E4  and  E5  provide 
excellent  approximations  for  all  the  listed  measures. 


Method  one  uses  linear  regression  and  "goodness  of 
fit”  principles  to  select  the  best  active  set  for  all  pos- 
•3ii  le  .-ombina  tio.ns  of  the  potential  information  functions. 
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The  method  involves  five  steps: 

1.  Generate  the  sample  cumulative; 

2.  Obtain  the  sample  density,  f  (x) , 
differentiation; 


m 


via  numerical 


3.  Equate  ln(f(x))=  l  X.g. (x)  where  m  is  the  num- 

i=0 

ber  of  functions  in  the  potential  set; 

4.  Regress  on  the  equation  of  step  three  to  pro¬ 
duce  several  candidate  active  sets; 

5.  Use  statistical  measures  to  select  the  best  set. 


Method  one  has  been  tested  against  various  distributions 
in  the  normal,  beta,  and  gamma  families  and  against  the 
simulation  models  of  Chapter  IX.  An  acceptable  approxima¬ 
tion  resulted  in  every  case  u^mg  the  potential  set  of 
Table  VI. III.  The  method  is  based  on  proven  analytic  and 
statistical  techniques,  provides  ample  opportunity  for 
analyst  input  and  modification,  and  produces  a  compromise 
between  the  sample  and  the  unknown  analytic  distributions. 
The  excellent  results  of  method  one  prompted  further  inves¬ 
tigation  of  active  set  selection  procedures.  The  next  two 
chapters  discuss  alternate  methods. 


Chapter  VII .  Active  Set  Selection — 


Method  Two  (Divergence) 


Introduction 

The  linear  regression  approach  of  method  one 
(Chapter  VI)  produces  excellent  results  in  selecting  the 
"active"  set  of  information  functions,  for  a  particular 
approximation,  from  the  predefined  "potential"  set.  Method 
one  demonstrates  strong  sample  dependence,  sensitivity  to 
numerical  errors,  and  sensitivity  to  choice  of  goodness  of 
fit  statistic.  The  selected  active  set  produces  a  distribu 
tion  which  adequately  fits  the  sample  and  underlying  ana¬ 
lytic  distributions  but  is  generally  a  compromise  between 
the  two.  Moreover,  the  active  set  need  not  include  the 
desired  analytic  information  functions.  While  a  method 
which  provides  such  an  accurate  approximation  is  certainly 
a  useful  tool,  if  the  method  can  also  identify  the  correct 
functions  for  the  underlying  analytic  distribution  then  use 
fulness  is  increased.  Additionally,  our  entropy  approxima¬ 
tion  procedure  is  based  on  information  theoretic  concepts. 

A  desire  to  improve  method  one  while  adhering  to  our  infor¬ 
mation  theoretic  theme  led  to  the  information  divergence 
measure  and  the  development  of  method  two.  We  discuss 
divergence  and  present  method  two  with  results. 


Ill 


Experimentation  with  the  regression  method  shows 


that  different  goodness  of  fit  measures  may  result  in  dif¬ 
ferent  active  sets.  The  goodness  of  fit  measures  test  our 
fit  to  the  sample  data,  i.e.,  to  the  sample  cumulative  or 
density.  Although  our  goal  is  c.n  accurate  approximation 
to  the  sample,  we  now  shift  our  concern  to  accurately 
approximating  the  "information  content"  of  the  data.  Thus, 
we  are  changing  our  measure  of  fit  to  the  data. 

The  Kui Iback-Leibler  measure  of  information  varia¬ 
tion  (Refs  30;  51)  measures  the  information  exchange  when 
one  probability  measure  is  replaced  by  a  second  probability 
measure.  As  defined  in  Chapters  II  and  IV,  the  information 
variation,  i.e.,  the  loss  or  gain  of  information,  which 
occurs  when  density  f(x>  is  replaced  by  density  p(x)  is 
do med 

I(p(v),f(x))  =  /P(x)  In  [p(x)/f(x)]dx 

Pullback,  Guiasu,  Jeffreys  (Refs  51;  33;  46)  and  others 
discuss  information  variation  and  its  properties. 

Information  variation  seems  like  the  perfect  con¬ 
ventual  measure.  We  would  like  to  minimize  the  information 
less  wnen  the  sample  density,  f(x),  is  replaced  by  the 
entropy  density,  p(x);  thus  we  select  the  entropy  density 
■.'.•hi'”  minimizes  I  (p  (x)  ,  f  (x)  )  .  Unfortunately,  information 
^iri.  iticn  is  not  commutative,  i.e.,  I  (p  (x)  ,  f  (x) )  (f  (x)  ,p  (x)  ), 

< 


and  I(p(x),f(x))  may  be  negative.  We  know  from  Guiasu  or 
Kullback  (Ref  Chapter  IV)  that 

p  (x)  dx*] 

I  (p(x)  ,f  (x)  )  >  fE  p(x)  dx  In  (X)  dx  1  (7.1) 

where  E  is  the  interval  of  comparison.  If  E  is  a  subset 
of  the  interval  of  definition  for  p(x)  then  /£p(x)  dx  may 
be  less  than  one.  Thus,  the  right-hand  side  of  equation 
(7.1)  may  be  negative,  and  I(p(x),f(x))  may  be  negative. 

Our  interval  of  comparison  will  be  dictated  by  the  random 
sample,  i.e.,  E=[x^,xN],  and  may  force  negative  values  for 
I (p (x ) , f ( x) ) .  Kullback  and  Jeffreys  extend  I(p(x),f(x)) 
to  a  more  usabl .  measure,  divergence,  which  retains  the 
conceptual  strength  of  information  variation. 

Kullback  defines  divergence,  J (p (x) , f (x) ) ,  as  a 
measure  of  the  difficulty  of  discriminating  between  two 
densities  where 

J(p(x)  ,  f  ( x ) )  =  I(p(x),f(x))  +  I(f(x)  ,p  (x)  ) 

=  /  [p  (x) -f  (x)  ]  In  [p(x)/f(x)]  dx 
=  / [ f (x)-p(x)]  In  [f  (x) /p(x) ]  dx 
=  J  (f  (x)  ,p  (x) ) 

As  Kullback  points  out,  J  (p  (x)  ,  f  (x)  )_>0  with  equality  if  and 
only  if  p(x)=f(x)  almost  everywhere  (a.e.).  A  simple 
application  of  equation  (7.1)  will  show  that  J (p (x) , f (x) ) >0 
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regardless  of  our  interval  of  comparison.  Thus,  divergence 
offers  a  viable  means  of  comparing  information  loss  when 
using  an  entropy  approximation  in  place  of  a  sample.  The 
sample  density  is  available  at  M  points,  x^,  i=l,2,...M, 
and  we  can  approximate  f(x)  at  point  y  by  using  linear 
interpolation  between  f (x j)  and  f(x^+^)  where  x^^y£Xj+^. 

The  entropy  approximation  is  available  in  algebraic  form 
and,  thus,  J(p(x),f(x))  can  be  calculated  via  numerical 
quadrature.  We  seek  the  minimum  set  of  information  func¬ 
tions  which  defines  the  density  p(x)  with  minimum  diver¬ 
gence  from  f  (x) .  This  set  will  be  the  active  set  for  the 
given  data. 

Selection  Procedure 

The  active  set  selection  procedure  of  method  two 
is  analogous  to  multiple  nonlinear  regression  but  uses  the 
divergence  measure.  The  procedure  includes  two  phases,  a 
function  addition  phase  (analogous  to  forward  regression) 
and  a  function  deletion  phase  (akin  to  backward  regression) 
The  procedural  steps  follow: 

1.  Generate  the  sample  cumulative  distribution  at 
K  points  as  in  method  one. 

2.  Produce  the  sample  density,  f(x),  at  the  M 
points  by  numerical  differentiation  as  in  method  one. 

3.  Produce  expected  values  of  all  information  f unc 
ticns  the  potential  set  via  guadratui e  or  average  values 
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as  in  method  one.  Steps  1,  2  and  3  concern  data  prepara¬ 
tion  while  steps  4  through  6  describe  the  iterative  pro¬ 
cedure  . 

4.  Use  expected  values  in  the  appropriate  con¬ 
straint  equations  to  find  the  Lagrange  multipliers,  i.e., 
the  A  vector,  for  the  m  entropy  densities 

Pj (x)  =  exp  I-Xq-X j  g ^ (x) ] ,  j=l,2,...m, 

where  m  is  the  number  of  functions  in  the  potential  set. 
Each  pj (x)  thus  contains  one  information  function. 

5.  Find  the  value  of  j  such  that  J  (p  j (x)  ,  f (x) )  is 
a  minimum  and  let  this  value  equal  h.  Function  g^(x)  *-s 
now  considered  a  member  of  the  active  set  which  defines  the 
final  entropy  approximation,  p{x).  If  J(p(x) ,f (x) )<EPS 
where  EPS  is  a  predefined  stopping  criteria,  then  the 
"function  addition"  phase  of  method  two  is  complete.  If 
J(p(x) ,f (x) ) >EPS  then  function  gh(x)  is  retained  in  the 
active  set,  and  we  iterate  to  find  the  next  best  function 
to  add  to  the  active  set.  Steps  4  and  5  are  reaccomplished 
for  the  m-1  entropy  densities  (x)  where 

Pj  (x)  =  exp  [-XQ-Xh  gh(x)-  gj  (x)  ]  ,  j=l,2,...m, 

j?<h.  This  procedure  continues  until  a  maximum  of  K  func¬ 
tions  (we  use  K=6)  are  active,  until  J (p (x) ,f (x) )^EPS,  or 
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until  the  addition  of  a  function  increases  J (p (x) , f (x) ) . 

We  thus  select  the  active  set  with  minimum  J (p (x)  ,f (x) )  . 

6.  The  final  procedural  step,  the  "function 
deletion"  phase,  checks  for  redundant  information.  Let  k 
be  the  number  of  active  information  functions,  i.e.,  the 
functions  defining  p(x),  and  consider  the  possibility  that 
we  have  selected  too  many  functions.  Let  a(x)  be  p(x)  with 
one  of  the  active  functions  removed.  If  J(p(x),q(x))  is 
close  to  zero,  where  "close"  is  defined  by  the  user,  then 
we  lose  very  little  information  in  dropping  the  subject 
function.  For  small  divergence,  we  replace  p(x)  with  q(x) 
and  iterate  to  test  each  function  for  deletion.  This 
removal  step  was  implemented  as  a  seoarate  subroutine 
(THROUT)  because  it  is  used  in  method  three  and  may  be 
applied  to  method  one  for  better  results.  THROUT  con¬ 
tributed  appreciably  to  the  excellent  results  obtained  with 
method  two. 

Results 

Method  two  uses  an  approach  which  is  conceptually 
similar  to  method  one  but  with  a  different  measure  of  fit, 
i.r.,  divergence.  Divergence  is  an  accepted  information 
r.e.tvuro  which  evenly  weights  the  data  points  and  follows 
the  Information  theoretic  thrust  of  the  dissertation.  One 
'■  st  notice,  however,  that  the  selected  active  set  is  not 
a  : r  intend  to  bo  the  single  best  set  in  the  divergence 


sense.  This  is  not  of  great  concern  because  the  selected 
set  must  produce  an  acceptably  small  divergence  and  thus  pro¬ 
duce  an  acceptable  fit  to  the  sample.  Method  two  may  reach 
a  "small"  divergence  before  considering  the  single  best  set 
of  functions.  Method  one  has  a  slight  advantage  in  this 
respect  because  method  one  uses  "leaps  and  bounds"  linear 
regression  which  considers  more  possible  combinations  of 
functions.  With  either  method,  the  only  way  to  ensure  that 
the  single  best  set  is  selected  is  to  check  all  possible 
combinations  of  potential  functions.  Such  a  procedure  is 
not  practical,  and  results  with  both  methods  indicate  that 
such  a  procedure  is  not  necessary. 

Table  VII. I  compares  the  results  lor  methods  one 
and  two  when  applied  to  two  sample  densities  from  Chapter 
VI  (normal  and  beta)  and  a  third  sample  from  a  gamma  dis¬ 
tribution.  The  function  symbols  are  defined  in  Table 
VI. Ill  of  the  last  chapter.  Method  two  selected  the  cor¬ 
rect  analytic  functions  for  the  normal  and  gamma  samples. 

An  excellent  aoproximation  was  produced  for  the  beta 
although  the  desired  analytic  functions  were  not  selected. 
Method  one  results  were  statistic  dependent  but  were 
generally  more  oriented  to  the  sample  distribution.  Addi¬ 
tional  samples  from  the  same  distribution  families  were 
tested  with  similar,  although  not  exact,  results.  In  all 
tests,  if  either  method  chose  functions  other  than  the  ana¬ 
lytic  functions,  the  resulting  approximations  were  still 
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exceptionally  close  to  sample  and  analytic  distributions. 

Of  course  when  the  analytic  functions  were  selected,  the 
fit  to  the  analytic  was  exact.  A  fourth  instructive  com¬ 
parison  of  the  methods  is  made  in  Chapter  IX  for  a  simula¬ 
tion  application. 

Method  two  requires  specification  of  the  expected 
values  for  the  potential  information  functions  and  defini¬ 
tion  of  the  approximation  bounds,  i.e.,  the  interval  over 
which  the  entropy  approximation  will  apply.  For  the  given 
examples,  expected  values  were  calculated  via  quadrature. 
The  bounds  are  specified  by  the  analyst  and  may  be  based  on 
knowledge  of  the  unknown  distribution,  quadrature  results, 
simulation  results,  or  the  random  sample.  The  analyst 
usually  acquires  such  bounds  in  generation  of  the  expected 
values.  The  interval  [x^x^]  from  the  sample  will  suffice 
if  the  expected  (or  average)  values  are  calculated  on  this 
interval . 

The  iterations  of  method  two  for  the  normal  sample 
are  shown  in  Table  VII. II.  The  first  three  iterations  of 
Table  VII. II  represent  the  function  addition  phase  which 
selects  F2,F6,F1.  We  stop  at  this  point  because  the  diver¬ 
gence  in  adding  Fl  changes  very  little  from  the  previous 
iterations,  and,  in  fact,  increases  slightly.  The  analyst 
may  have  preferred  to  stop  at  F2,F6.  Notice  that  in  the 
function  additions  phase  we  are  comparing  entropy  to  sample 
densities.  The  entropy  densities  are  calculated  on  the 
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TABLE  VII. II 


DIVERGENCE  METHOD  APPLIED  TO  NORMAL  SAMPLE 


A. 

Function  Addition  on 

=  [6.07,  13. 

,35] 

Iteration 

Function  Set 

Divergence 

1 

F  2 

J(P2 

(x)  ,f  (x) )  = 

.02257350 

2 

F  2  ,  F  6 

J(p26 

(x) , f (x) )  = 

.02252025 

3 

F2,Ff , F 1 

J(p261 

(  X  )  ,  f  ( X  ) )  - 

. 02252026 

B.  Function  Deletion  on  [t-4c,  u  4  o  ]  =  [4.34,  15.66] 


1 1  < '  r  .i  t :  on 

Divergence 

Actio 

4 

J  ^2fl  ^  ^  ^ 

r-H 

O 

CO 

li 

Retain 

F  2 

T 

J (p261 (X) ,P21 (x) J 

=  1  (-20) 

Delete 

F  6 

>) 

J (p261 (x) ,p2  (x) ) 

=  2  (-7) 

Delete 

FI 

C.  Active  Set  =  F2 


NOTE:  Function  symbols  defined  in  Table  VI. Ill 


interval  {y-4o,p+4a]  whereas  the  sample  is  defined  on  the 
subset  [x^/X^j];  the  divergence  comparison  is  thus  made  on 
the  smaller  bound.  We  return  to  [p-4o,p+4o]  for  function 
deletion  (subroutine  THROUT)  which  compares  entropy  densi¬ 
ties.  The  final  active  set  is  F2  as  desired. 

Experimentation,  as  exemplified  in  Table  VII. I, 
indicates  that  the  divergence  approach  is  more  likely  to 
correctly  identify  the  analytic  functions  and  is  less 
sample  sensitive  than  method  one.  Howe'er,  method  two  is 
still  sample  sensitive  as  seen  with  the  beta  approximation. 
Method  two  selected  functions  F2,F5,F9  to  produce  a  diver¬ 
gence  J ( f  (x)  , p (x) )=. 022 3359 .  The  correct  analytic  func¬ 
tions  are  F3  and  F5.  We  calculated  the  divergence  between 
the  analytic  density,  q(x),  and  the  sample,  f (x)  to  find 
J(f  (x)  ,q  (x) )  =  . 0249148.  Thus  method  two  chose  functions 
that  provides  a  closer  fit  to  the  data,  in  the  divergence 
sense,  than  if  the  correct  analytic  functions  had  been 
chosen.  Table  VII. Ill  provides  actual  values  of  the 
respective  densities  at  17  of  the  32  quadrature  points  used 
in  the  beta  example.  Sampling  error  and  numerical  differen¬ 
tiation  account  for  the  discrepancy  between  sample  and  ana¬ 
lytic  values. 

Methods  one  and  two  are  quite  similar  in  structure 
and  results.  The  nonlinear  regression  approach  of  method 
two  appears  to  be  less  sample  sensitive  than  method  one, 
i.e.,  method  two  is  more  likely  to  identify  the  underlying 


TABLE  VII. Ill 


SAMPLE 

,  ENTROPY, 

AND  ANALYTIC 

COMPARISON  FOR  BETA 

SAMPLE 

T 

X(I) 

Sample 

Density 

Method  2  Entropy 
Density 

Analytic 

Density 

1 

.  17069 

.26205 

.10144 

.08248 

2 

.18411 

.26672 

.11593 

.10183 

3 

.21237 

.27656 

.15229 

.15068 

4 

.25442 

. 30384 

.22379 

.24558 

C 

o 

r* 

CO 

o 

n 

. 36929 

. 35396 

.40673 

6 

.  37318 

.  51643 

. 57543 

.65131 

- 

.44  545 

.  97566 

. 91564 

.98034 

8 

. 72284 

1 .  300  94 

1 . 36148 

1 . 36394 

a 

. 60244 

1.80312 

1.82053 

1.73851 

;  r 

.68131 

1  .61865 

2 .13328 

2.01572 

]  1 

.7  564  9 

2.4  3046 

2 . 16621 

2.  10S42 

■  n 

a.  £. 

.  82519 

1  .  54359 

1 . 91065 

1  .  964  54 

- 

. 8S484 

1.62170 

1.47946 

1.59560 

14 

. 93323 

.75704 

1.01420 

1.08535 

15 

.  96885 

.44177 

.60802 

.57145 

16 

.98949 

.  38228 

.30344 

.20361 

17 

. 99430 

.  36861 

.20920 

.11200 

122 


analytic  distribution.  Method  two  is  less  cumbersome  to 
use.  However,  method  one  may  be  preferred  when  a  test  of 
hypotheses,  or  a  confidence  bound,  about  the  accuracy  of 
the  entropy  approximation  is  desired.  Method  one  offers 
flexibility  in  choice  of  statistic  and  is  a  more  tradi¬ 
tional  approach.  The  key  point  is  that  both  methods  will 
produce  excellent  approximations,  given  a  workable  poten¬ 
tial  set.  The  choice  of  method  is  at  the  analyst’s  discre 
tion. 

The  methods  produce  excellent  results  but  share  a 
common  disadvantage.  Both  methods  require  a  random  sample 
of  the  unknown  distribution  and  both  involve  numerical  dif 
f erentiation .  Sampling  and  differentiation  errors  are  two 
reasons  for  failing  to  explicitly  identify  the  underlying 
analytic  distribution.  Method  three  provides  a  viable 
alternative  which  avoids  these  error  sources. 
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Chapter  VIII .  Active  Set  Selection — 


Method  Three  (Expected  Values) 


Introduction 

This  chapter  presents  a  third  method  for  selecting 
the  best  active  set  of  information  functions  from  a  pre¬ 
defined  potential  set.  Methods  one  and  two,  though  effec¬ 
tive,  are  subject  to  sampling  and  numerical  differentiation 
errors.  Method  one  (linear  regression)  produces  an  approxi¬ 
mation,  p(x) ,  that  compromises  between  sample  and  analytic 
distributions  with  a  tendency  to  match  the  sample.  Method 
two  (divergence)  approximations  produce  similar  compromises 
but  with  a  strong  tendency  toward  the  underlying  analytic. 
Method  three  (expected  values)  concentrates  on  the  under¬ 
lying  analytic  distribution  from  which  the  sample  is 
generated.  Method  three,  like  methods  one  and  two,  requires 
the  expected  values  of  all  information  functions  in  the 
potential  set  and  definition  of  the  interval  of  approxima¬ 
tion,  Ia,b].  However,  method  three  does  not  use  a  sample 
cumulative  or  density  and  consequently,  is  faster  and  less 
complicated  than  previous  methods. 

The  expected  values  method  is  based  on  the  premise 
that  the  expected  values  of  the  potential  information  func¬ 
tions  communicate  sufficient  information  to  accurately 
approximate  the  unknown  distribution.  Let  f (x)  represent 
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the  unknown  underlying,  analytic  density,  and  let  <G>  be 

m 

the  vector  of  expected  values  for  the  m  potential  functions 
where  <G>  = [<g, > , <g,> , . . . <g  > ] T  and  <g  .  >  =  / b g  .  (x) f (x)  dx  . 

Our  information,  the  <G>m  vector,  is  generated  by  the 
unknown  analytic  distribution,  i.e.,  via  quadrature,  simu¬ 
lation  or  averages  since  f (x)  is  unknown.  An  accurate 
entropy  approximation  to  f (x)  must  generate  an  accurate 
approximation  to  <G  .  For  example,  if  the  entire  poten¬ 
tial  set  is  included  in  p(x),  i.e., 

p(>:)  =  exp  [  -  ’*  q-  1  j  a j  (x)  -  .  .  .  -  'l^c  (>:)], 

then  p (x)  will  generate  <G  >  exactly.  Now  assume  that 

m  J 

f  (x)  is  -j  normal  distribution.  We  know  that 

p(x)  -  exp  t-'o"  ?i  (*)  -  >2<52  (x)  ^  “  exp  [ -  Ap- > ^x- X2x  : ) 

is  the  unique  entropy  characterization  of  the  normal,  and 

n ( x )  will  generate  the  same  <G >  vector  as  f  (x) .  In  this 

m 

normal  example,  <-g^ '  through  <g^>  represent  redundant 
information.  Jaynes  states  (Ref  45)  and  experimentation 
confirms  that  redundant  information  is  eliminated  from  the 
entropy  density,  i.e.,  solution  of  the  m  constraint  equa¬ 
tions  in  oui  normal  example  will  result  in  =0. 

r  3  4  m 

.,uch  a  result  is  predicted  by  our  uniqueness  theorems  of 
Chauier  IV. 

We  thus  define  the  active  set  of  information  func- 
t j  ; : i s  to  be  the  minimum  sot  of  potential  functions  that 
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acceptably  reproduces  <G>m>  This  approach  again  empha¬ 
sizes  the  importance  of  defining  a  large,  flexible  poten¬ 
tial  set,  as  discussed  in  Chapter  V,  so  that  sufficient 
information  about  the  unknown  density  is  communicated. 

Due  to  numerical  difficulties,  we  cannot  in  general  solve 
the  m  nonlinear  constraints,  for  a  large  m,  to  find  the 
unique  l^'s,  j=0,l,...m.  Consequently,  method  three  builds 
an  active  set  by  progressively  fitting  the  <G>m  vector  and 
then  checking  for  redundant  functions.  The  approach  is 
similar  to  the  regression  tactics  of  previous  chapters. 

Selection  Procedure 

Method  three  is  an  iterative  procedure  which  we 
decompose  into  the  following  steps: 

1.  Specify  (a,b]  and  calculate  the  expected  value 
vector,  <G^mr  for  the  m  dimensional  potential  set.  The 
<G>m  vector  is  part  of  the  assumed  or  "given"  data.  As  in 
previous  methods,  we  include  data  collection  as  a  pro¬ 
cedural  step. 

2.  Use  the  expected  values  in  the  appropriate 
constraint  equations  to  find  the  m  entropy  densities 

p j ( x )  =  exp  [— A0— A jgj  tx) ] ,  j  =  l,2,...m, 

where  m  is  the  number  of  functions  in  the  potential  set. 
Thus,  each  p^ (x)  contains  one  information  function  on  the 
first  iteration. 
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3.  Use  each  density  in  step  two  to  produce  <5*  , 

an  estimate  to  <G>  ,  and  measure  estimation  error.  For 

m 


m 


each  entropy  density  we  use  quadrature  to  generate  <G> 

where  <5>m= [<g1> ,<g?> , . . .<gm>]T  and  <gk>=/^  g^ (x) p^ (x)  dx. 

We  then  calculate  the  error  of  estimation,  M.2,  where 

1 

M.2  is  the  sum  of  errors  squared; 


m 

M.2  =  I  (<g. >-<g. >) 2 ,  j=l , 2 , . . .m. 
J  i=l  1  1 


4.  Select  the  information  function  which  induces 

the  best  approximation  to  <G>  ,  i.e.,  Dick  the  minimum 

m 

_j 2  .  This  information  function  becomes  part  of  the  active 
set  and  thus  part  of  the  final  approximation,  p(x). 

5.  Check  stopping  criteria.  If  M2<_EPS,  where  EPS 
is  a  predefined  stopping  value,  then  we  have  defined  an 
effective  active  set  and  may  proceed  to  step  six.  If 

M  'EPS  then  we  iterate  to  find  the  next  best  function  to 
add  to  the  active  set.  For  the  second  iteration,  steps 
,  3,  4,  and  5  are  repeated  for  m-1  entropy  densities  where 
;  <>:)  =  exp  [-'  0-?-hgh  (x! -Aj9j  <x)  ]  ,  j  =  l,2,...m,  j#h,  where 

h  is  the  active  information  function.  This  procedure  con- 
t ; nuc -  tor  m-2  densities,  m-3  densities,  etc.;  that  is,  the 
•.ctive  set  arov.-s  until  M'<_EPS  or  until  K  functions  are 
active  (we  use  K=6) . 

T  we  exceed  K  functions  without  producing  a  suf¬ 
ficiently  small  ri?  then  we  have  reached  an  error  condition. 


£ 
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We  must  assume  an  insufficient  potential  set  and  consider 
additional  potential  functions.  Collection  of  a  random 
sample  to  prodr.-e  a  frequency  histogram  or  a  numerical 
approximation  to  the  unknown  density  may  provide  insight  at 
this  point.  An  example  of  an  insufficient  potential  set 
was  given  in  Chapter  V  for  a  distribution  that  resembled 
the  double  exponential.  Again,  the  potential  set  of 
Chapter  V  should  provide  sufficient  information  for  many 
characterizations  on  (a,b]  as  the  results  section  of  this 
chapter  will  show.  Once  we  obtain  a  sufficiently  small  M2 , 
we  consider  the  elimination  of  unnecessary  functions. 

6.  Eliminate  redundant  information  functions, 
i.e. ,  apply  subroutine  THROUT  of  Chapter  VII.  THROUT 
eliminates  functions  from  the  active  set,  one  function  at 
a  time,  and  evaluates  the  divergence  between  p(x)  with  the 
active  set  and  p(x)  with  one  less  function.  If  the  diver¬ 
gence  is  near  zero  ("near"  is  defined  by  the  analyst)  then 
the  subject  function  may  be  deleted  from  the  active  set. 
This  function  removal  step  is  repeated  until  one  complete 
pass  through  the  active  set  is  accomplished  without  a  func¬ 
tion  deletion.  Active  set  selection  is  then  complete. 

Results 

Method  three  was  tested  by  generating  <G>m  vectors 
for  known  distributions,  producing  the  entropy  approxima¬ 
tion,  and  comparing  the  two  densities.  Method  three 
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consistently  identified  the  desired  analytic  functions, 
when  such  functions  were  elements  of  the  potential  set,  and 
exactly  characterized  the  analytic  densities.  The  poten¬ 
tial  set  of  Chapters  V,  VI,  and  VII  was  used  for  our  test¬ 
ing  and  is  repeated  in  Table  VIII. I  for  convenience. 

Table  VIII. II  presents  representative  test  results.  Tne 
normal,  beta  (skewed  right)  and  gamma  (skew’ed  left)  dis¬ 
tributions  of  previous  chapters  are  shown  as  w’ell  as  six 
additional  distributions.  A  tenth  example  is  given  in 
Chapter  IX  for  a  simulation  output  distribution.  Graphs 
are  not  shown  for  most  of  the  Table  VIII. II  distributions 
because  the  approximation  errors  are  very  small,  i.e., 

sup | p (x) -f  (x)  I <10  u .  We  discuss  the  examples  to  demonstrate 

X 

the  strength  of  method  three. 

TABLE  VI I I. I 

POTENTIAL  INFORMATION  FUNCTIONS 


'  ’ymbol 

Information 

Function 

Symbol 

Information 

Function 

Symbol 

Information 

Function 

FI 

(x-u)/o 

F4 

In  (x-a) 

F7 

((x-p)/o)J 

F2 

( (X— J  /?) 2 

F5 

In  (b-x) 

F8 

( (x-u)  /a) " 

F3 

lnx 

F6 

(In  (x-a) ) 2 

F9 

In  (x2+l) 

NOTE:  t=  mean;  o=  standard  deviation;  and  [a,b]  «* 


counts . 
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TABLE  VIII. II 


I,  2  Bimodal  Unknown  -  -  F2,F6,F7,F8 

_ (Ficr,  8.1) _ 

NOTE:  Method  1  results  using  Kolmogorov-Smirnov  statistic  (Chapter  VI). 


The  beta  distribution  provides  an  interesting 
example.  The  beta  density  has  the  following  two  parameter 
form: 


fb(x)  =  [C/(b-a)P+Q_1]  (x-a)P_1 (b-x)Q_1,  a^xfb 

where  C  =  r (P+Q) / ( F (P) T )  and  T(.)  is  the  gamma  function. 
Substitution  cf  P=4 . 0 ,  Q=2 . 0 ,  and  [a,b]  =  [0,l]  in  f^fx) 
produces  f  (x)  =  20.0  x  3{l-x) .  Table  VIII. Ill  displays 
the  iterations  of  method  three  in  selecting  the  active 
set  F3,F5,  i.e.,  ln(x)  and  ln(l-x).  Notice  that  func- 
tiors  F8  and  F2  were  introduced  and  ultimately  eliminated. 
The  final  entropy  density  from  Table  VIII. Ill  is 

p(x)  =  exp  [ - ?. 0 - X 3  ln(x)->5  ln(x-b)] 

=  exp  [2.9957  +  3  ln(..'  +  ln(x-l)] 

=  exp  [2.9957]  x3 (x-1) 

=  19.999999  x3 (x-1) 

which  we  ro'ond  to 

p(x)  =  20.0  x3(x-l)  =  fb<x)  exactly. 

The  normal  and  gamma  examples  produce  similar 
results.  The  gamma  density  with  shape  factor  G  and  scale 
parameter  B  is 

f  o ( x )  =  C  xG  ^exp[-x/B],  x,G,Bn0 


<r- 
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TABLE  VIII. Ill 


METHOD  THREE  RESULTS  FOR  BETA  EXAMPLE 


Iteration 

A.  Function  Addition 

Function  Set 

M2 

1 

F8 

.024418 

2 

F8,F2 

.016227 

3 

F8 ,F2 ,F5 

.005436 

4 

F8,F2,F5,F3 

1.1  (-26) 

XQ  =-2.99573 

,  A2=-5 . 5  (-13)  ,  X3— 3.0,  A 5=~1  •  0 , 

B.  Function  Deletion  (THROUT) 

Ag=l .7 (-15) 

Iteration 

Divergence 

Action 

5 

J  (p8253  (x) ,p253  (x) ) =2.  (-28) 

Delete  F8 

6 

J (p8253 (x) , p53  (x) ) =8 .  (-25) 

Delete  F2 

7 

J*PS253 ,p3  (+19) 

Retain  F5 

8 

J (p8253 (x) ,p5  (x) ) =22 . 805 

C.  Active  Set  =  F5,F3 

A0=-2. 99573,  A3=-3.0,  A5=-1.0 

Retain  F3 
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where  C  =  1.0/ [ T (G) (BG) ] .  With  G=1 . 5  and  B=2.0,  we  get 
f  (x)  =  .  398940  x *  5exp (- . 5x) 

which  is  skewed  left  with  mean  3.0  and  variance  6.0.  We 
chose  bounds,  [0,17.0],  based  on  the  random  sample  which 
was  used  in  methods  one  and  two.  Table  VIII. IV  shows  the 
gamma  progression.  We  notice  that  our  quadrature  sub¬ 
routine  produced  ’.1  =  2.9864995  and  o2  =  5. 8143277  on  the  [0,17.0] 
interval.  Clearly,  more  accurate  values  for  t  and  c2  would 
be  obtained  or.  larger  intervals.  However,  the  given  values 
ore  accurate  for  [0,17.]  and  the  entropy  approximation  is 
concerned  with  this  interval.  The  point,  is  that  expected  val 
estimates  should  be  computed  over  the  interval  of  interest 
as  we  have  done  in  our  examples.  Method  three  chose  FI 
and  F 3 ,  or  x-.. /..  and  lr.(x),  for  the  active  set.  From 
Table  VIII. IV, 


p  (x ) 


exp [ - \Q-X ^ (x-t ) /c-\ ^  In  (x) ] 

3 

exp[-X0+\^p/o]  exp  [(-'1/r)x]  x 


=  .398942  °xp[-.49999998x]x' 5 


whi-h  a 

f  (x?  . 
g 

i  ?  the 


g3in  provider  a  rather  accurate  approximation  to 
The  key  to  the  extreme  accuracy  in  all  the  examples 
fact  that  the  data,  i.e.,  the  vector,  is  accu- 

The  tire  and  money  invested  by  the  analyst  to 
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TABLE  VIII. IV 


METHOD  THREE  RESULTS  FOR  GAMMA  EXAMPLE 


Iteration 

A.  Function  Addition 

Function  Set 

M2 

1 

F  5 

1.50377 

2 

F5,F8 

.171532 

3 

F5,F8,F3 

.001014 

4 

F5 , F8 , F  3 , FI 

1.7  (-25) 

A 0=2 . 412188 , 

1^=1.205646,  A  3  =  - . 5 ,  A  5=5 . 9 { -1 1 )  , 

A  g=l . 1 (-13) 

B.  Function  Deletion  (THROUT) 

Iteration 

Divergence 

Action 

5 

J(p5831(x) ,p831(x))=2.4  (-20) 

Delete  F5 

6 

J (P5831 (x) ,p3i  (x) ) =2 . 4  (-20) 

Delete  F8 

7 

J ( p5  8  3i ( x ) , pi  (x) ) =. 3155 

Retain  F3 

8 

J (p5831  (x) ,p3  (x))  =15.10 

Retain  FI 

C.  Active  Set  =  FI,F3 
AQ=2. 412188,  1^=1.205646,  A3=-. 


5 


■w 


produce  accurate  average  values  is  well  rewarded  with 
method  three. 

Distribution  number  seven  of  Table  VIII. II,  which 
we  call  the  hyperbolic,  was  investigated  because  it  is 
similar  in  appearance  to  an  exponential  distribution  on 
a  bounded  interval.  Our  intent  was  to  see  if  method  three 
would  distinguish  between  similar  distributions.  The 
density  for  our  hyperbolic  is 

f^(x)  =  1/ I2x  In  (a)  )  ,  l/a^x<_a 
or  fh (x)  =  exp  [-In  x  +  In (f ) ] 

Thus  ln(x),  F3,  is  the  desired  analytic  information  func¬ 
tion,  and  method  three  must  produce  >Q=~ln(a2)  and 
X^=1.0  for  an  exact  fit.  Calculation  will  show  for  f^(x) 
that  >=(  r-l)  /  (2a-ln  a)  .  For  the  exponential, 

f  (x'  =  Sexp  [  —  Bx ]  ,  0<x'co 
e  — 

with  ■'x'>=-l/2.  We  wished  to  test  hyperbolic  and  exponen¬ 
tial  distributions  with  the  same  means.  We  thus  selected 
the  a  parameter  for  the  hyperbolic,  calculated  <x>,  and 
used  <x  '  to  find  the  exponential  parameter.  The  result  was 
two  similar  distributions  with  the  same  means,  though  the 
exponential  is  applied  over  a  larger  interval.  Method 
three  distinguished  between  the  two  distributions  based  on 
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respective  <G>m  vectors,  as  shown  in  TableVIII.il,  and  pro¬ 
duced  accurate  representations  once  more. 

The  uniform  and  bimodal  distributions  were  intro¬ 
duced  as  extreme  cases.  The  maximum  entropy  density  for  an 
unknown  distribution  given  only  the  interval  [a,b]  and  no 
further  information  is  the  uniform  distribution.  Method 
three,  when  given  a  <G>m  vector  from  a  uniform  distribu¬ 
tion,  should  thus  select  only  the  constant  parameter,  Xg. 
The  method  performed  perfectly. 

The  bimodal  distribution  was  taken  from  reference 
14  which  discussed  a  discrete  entropy  approach  to  develop 
density  histograms.  The  bimodal  density  is  composed  of 
two  "tent"  functions, 

f  ( x )  =  2x  01*11/2 

=  2  (1-x)  1/2<x<1 

=  2  (x-1)  l<x<3/2 

=  4 -2x  2/2<x<2 

=  0  otherwise. 

Our  continuous  entropy  approximation  procedure  was  devel¬ 
oped  for  unimodal  distributions  as  evidenced  by  our  poten¬ 
tial  function  set  (Ref  Chapter  V) .  Thus  our  potential  set 
does  not  contain  the  correct  information  functions  to  pro¬ 
vide  an  exact  fit  to  f  (x) ,  but  a  reasonable  approximation 
results.  Figure  8.1  graphs  the  analytic  and  entropy  densi¬ 
ties.  Our  continuous  entropy  approach  provides  a  density 
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motion  (Method  3) 


that  has  the  same  general  shape  as  the  histogram  approxi¬ 
mation  but  also  provides  an  analytic  form  for  the  approxi¬ 
mation  density,  p(x).  (Reference  14  contains  graphs  of  the 
histogram  approximation.) 

Measure  Sensitivity 

The  average  values  procedure  of  method  three  is 
based  on  a  fit  to  the  <G>m  vector.  The  method  progressively 
selects  the  information  functions  that  contribute  to  accom¬ 
plishing  this  fit.  Such  a  procedure  is  sensitive  to  the 

measure  of  fit,  i.e.,  M2  .  We  selected  a  least  squares 
m 

measure,  M2  =  Z  (<g . >-<g . >) 2 ,  after  experimentation  with 
i=l  1  1 

various  forms  of  this  measure.  For  successful  application, 
M2  must  measure  the  "relative"  contribution  of  an  informa¬ 
tion  function  regardless  of  the  size  of  a  particular 
expected  value.  For  example,  if  ggfxj^x1*  and  g1(x)=x 
on  interval  [50,100]  then  <gg>  will  be  much  larger  than 
<g^>.  Thus  the  contribution  of  (<g8>-<gg>)  to  M2  will  have 
more  importance  than  (<g1>-<g^>) ,  and  we  would  expect  the 
procedure  to  first  select  the  function  that  produces  the 
smallest  (<g8>-<gg>). 

Various  forms  of  ratio  tests  were  investigated  in 
an  effort  to  evenly  weight  the  information  functions,  but 
such  measures  proved  cumbersome.  The  most  straightforward 
approach  is  to  equalize  the  influence  of  information  func¬ 
tions  prior  to  testing,  i.e.,  scale  g.(x),  i=l,2,...m  such 


that  the  <gi>  are  roughly  equivalent  in  value.  Normaliza¬ 
tion  of  <G>m  is  not  sufficient  because  the  size  bias  is 
maintained.  Thus,  we  independently  scale  potential  informa¬ 
tion  functions  if  such  functions  promise  to  produce  large 
average  values  on  the  interval  of  approximation.  For 

example,  we  may  replace  moments,  x  ,  with  normalized  central 
k  k 

moments,  (x-p)  /o  .  Moments  were  the  only  information 
functions,  in  the  potential  set  of  Chapter  V,  that  required 
sea] inn  for  our  applications.  The  analyst  should  be  aware 
of  the  possible  need  for  function  scaling  for  large  approxi¬ 
mation  bounds,  [ a , b 1 . 


Summary 

The  goal  of  our  entropy  approximation  procedure  is 
to  acceptably  approximate  an  unknown  distribution  based  on 
obtainable  information.  Method  three  has  shown  that  we  can 
provide  an  extremely  accurate  characterization,  given  the 
interval  of  approximation  and  the  expected  values  of  certain 
information  functions.  The  accuracy  of  approximation 
depends  on  the  accuracy  of  <G>m  and  the  flexibility  of  the 
potential  set.  The  potential  set  of  Chapter  V  is  extremely 
productive  for  unimodal  distributions  on  a  bounded  interval. 

Method  three  has  been  implemented  as  a  FORTRAN 
comparer  subroutine,  METH3.  The  program  uses  previously 
discussed  subroutines  THROUT,  which  performs  the  function 
deletion  step,  and  ENTRCP ,  which  solves  the  constraint 
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equations  for  the  entropy  density  parameters.  METH3  was 
used  in  all  the  above  examples.  The  interval  bounds  [a,b] 
and  vector  <G>m  are  required  subroutine  inputs,  and  the 
active  set  and  p(x)  are  returned.  The  analyst  can  modify 
the  potential  function  set  by  changing  function  cards 
which  do  not  involve  the  principle  subroutines.  The  pro¬ 
gram  is  constructed  to  handle  12  information  functions  in 
the  potential  set  although  the  9  functions  of  Chapter  V 
(repeated  in  Table  VIII. I)  have  proved  sufficient  in 
experimentation . 

The  excellent  performance  of  the  expected  values 
method  is  complemented  by  its  simplicity  and  the  lack  of 
a  need  for  a  large  random  sample  from  the  unknown  distribu¬ 
tion.  However,  the  analyst  may  prefer  to  produce  a  sample 
to  make  density  and  cumulative  comparisons  as  in  methods 
one  and  two.  Such  comparisons  confirm  the  accuracy  of 
approximation  and  indicate  if  modifications  to  the  poten¬ 
tial  set  are  needed. 

As  a  final  comment,  we  review  the  purpose  of  the 
selection  procedures.  The  purpose  is  to  select  the  active 
set  of  functions  for  a  specific  approximation.  If  the 
analyst  has  already  identified  the  active  set,  i.e.,  he 
knows  the  family  of  the  unknown  distribution,  then  he  may 
circumvent  selection  procedures  and  simply  solve  the  con¬ 
straints  to  completely  specify  p(x).  If  the  analyst  can 
only  generate  expected  values  (or  averages)  for  certain 
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functions,  then  those  functions  may  compose  the  potential 
set,  and  method  three  may  be  used  to  select  the  active  set. 
If  a  random  sample  is  available  then  the  analyst  may  prefer 
methods  one  or  two.  Finally,  the  analyst  may  prefer  a 
combination  of  the  above  techniques  or  application  of  all 
three  methods. 

We  have  demonstrated  that  the  methods  work.  The 
selection  of  a  particular  method  depends  on  the  specific 
problem,  available  data,  and  data  accuracy.  Our  approach 
has  been  to  first  apply  method  three  because  it  is  the 
easiest  to  use.  However,  inaccurate  expected  values  may 
cause  unsatisfactory  results  with  method  three  as  demon¬ 
strated  in  the  "interval  arithmetic"  application  of  Chapter 
XI.  Methods  one  or  two  may  then  be  preferred. 


Chapter  IX .  Appl ication  to  Simulation 


Introduction 

Previous  chapters  have  described  an  effective  pro¬ 
cedure  for  approximating  an  unknown  distribution  and  pro¬ 
viding  a  closed  form  for  the  approximating  density,  p(x). 
The  procedure  has  broad  application  in  mathematical  and 
stochastic  analysis.  This  chapter  discusses  the  use  of 
our  procedure  to  approximate  the  output  distribution  of  a 
computer  simulation,  an  application  which  motivated  the 
development  of  our  method.  The  method  was  designed  to  sup 
port  simulation  studies  at  the  Air  Force  Flight  Dynamics 
Laboratory  (AFFDL) .  Vehicle  Synthesis  Branch. 

The  strength  and  importance  of  the  entropy  approxi 
mation  procedure  as  applied  to  simulation  will  be  demon¬ 
strated  by  considering  an  AFFDL  example  problem.  The 
general  simulation  problem  and  usual  output  characteriza¬ 
tion  approaches  are  considered,  followed  by  discussion  of 
the  entropy  approach  with  example. 

Simulation  Model 

A  computer  simulation  may  be  viewed  as  a  "black 
box"  model  which  provides  a  distribution  of  output  values 
based  on  user  specified  input  distributions.  We  consider 
a  simplified  "black  box"  model: 
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f  (x)  on  [a  ,b]- 


F(x) 


-f  (y )  on  [c,d] 


where  f (x)  is  the  density  function  of  input  random  variable 
X,  F (x)  is  the  mathematical  transformation  which  repre¬ 
sents  the  simulation,  and  f(y)  is  the  output  random  vari¬ 
able  density  function.  Notice  that  random  variables  X  and 
Y  may  be  vector  valued  although  we  consider  univariate 
input  and  output  for  now.  The  transformation  F(x)  is  known 
to  the  user  or  is  available  as  a  computer  subroutine. 

This  model  highlights  two  significant  points  about 
simulation.  First,  the  input  distribution  must  be  com¬ 
pletely  specified  by  the  simulation  user.  Input  specifica¬ 
tion  may  take  several  forms  (e.g.,  a  set  of  discrete  values 
for  X,  a  density  function  for  X,  or  a  curve  representing 
possible  values,  etc.),  but  the  input  is  specified  and  thus 
known.  Second,  the  usual  purpose  of  the  simulation  is 
evaluation  of  the  output.  The  user  may  require  a  sample 
output  distribution,  ar.  average  value  of  the  output  vari¬ 
able,  or  an  indication  of  output  sensitivity  to  input 
variations.  The  simulation,  in  general,  becomes  more  use¬ 
ful  to  the  user  as  his  degree  of  knowledge  about  the  output 
distribution  increases.  However,  simulation  output  is  a 
set  of  discrete  values.  To  best  serve  the  user,  this  set 
of  values  must  be  manipulated  to  describe  the  stochastic 
nature  of  the  output  in  terms  of  a  distribution  or  density 
function.  Thus,  the  input  is  known,  and  the  transformation 
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is  known  (or  available  as  a  subroutine) .  We  seek  an  approxi¬ 
mation  or  characterization  of  the  output  distribution. 


Output  Characterization 

How  might  an  analyst  characterize  the  output  dis¬ 
tribution  given  the  known  information  in  the  previous  model? 
The  answer  to  this  question  depends  on  the  nature  of  the 
mathematical  transformation,  F(x).  If  F(x)  is  known  and 
linear  or  mathematically  "nice,"  then  an  analytical  solu¬ 
tion  may  be  feasible.  For  example,  the  analyst  may  be  able 
to  propagate  the  input  density,  f  (x) ,  through  the  model 
using  transformation  of  variable  techniques  to  analytically 
derive  f (y)  without  computer  simulation.  One  might  con¬ 
sider  decomposition  of  F(x)  into  a  series  of  less  complex 
transformations  for  this  purpose.  For  nonlinear  transforma¬ 
tions,  linear  approximation  is  a  popular  technique  and  has 
been  successfully  applied  to  very  complex  modeling  problems. 

Research  conducted  by  Orr  (Ref  63)  provides  an 
example  of  such  an  application  to  an  extensive  antiaircraft 
artillery  simulation.  While  linear  approximation  was 
acceptably  accurate  for  several  nonlinear  portions  of  the 
simulation,  simplified  nonlinear  models  were  needed  for  a 
few  submodels.  Clearly,  the  analytical  approach  is  the  pre¬ 
ferred  method  when  possible.  However,  detailed  modeling 
of  real  world  processes  seldom  results  in  transformations 
which  are  analytically  manageable.  The  most  frequent 
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approach  when  difficult  transformations  are  involved  is 
Monte  Carlo  simulation. 

Monte  Carlo  simulation,  while  a  potentially  accu¬ 
rate  and  reliable  means  to  model  real  world  systems,  pos¬ 
sesses  two  notable  disadvantages.  First,  a  large  number 
of  simulation  trials  are  needed  to  provide  adequate  infor¬ 
mation  about  the  output  distribution.  Thus,  computational 

A 

expense  may  restrict  Monte  Carlo  analysis,  particularly  in 
terms  of  output  sensitivity  to  input  variations.  Secondly, 
the  output  of  a  Monte  Carlo  simulation  is  a  random  sample. 

A  suitable  method  of  distribution  approximation  must  be 
applied  to  this  sample  to  derive  meaningful  stochastic 
information.  The  entropy  procedure  provides  a  distribu- 

V 

tion  approximation  and  can  reduce  the  number  of  trials  or 
simulation  calls  required  for  approximation  and  for  subse¬ 
quent  sensitivity  analysis. 

Entropy  Approximation 

The  entropy  approximation  procedure,  when  applied 
to  computer  simulation,  provides  a  usable  and  minimally 
prejudiced  density  function,  effectively  uses  available  or 
computable  information,  and  provides  analytical  ins^j' 
that  is  not  afforded  by  Monte  Carlo  simulation.  We  discuss 
this  application  and  potential  computational  benefits  for 
sensitivity  studies. 

e 
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The  entropy  procedure  was  presented  in  Chapter  III 
and  expanded  in  subsequent  chapters.  The  simulation  output 
density,  f (y) ,  is  approximated  by  an  entropy  density  of 
the  form 

P  (y )  =  exp  [-X0  -X1  g;L  (y)  -...Xk  gk(x)  ] 

where  g^(y),  i=l,2,...k  are  information  functions.  We 
highlight  the  benefits  of  the  entropy  application  to  simula¬ 
tion  by  reviewing  the  procedural  steps. 

Select  the  Active  Set .  Three  methods  were 
described  for  selecting  the  active  set  of  k  functions  from 
a  predefined  potential  set  of  m  functions  (Chapters  VI, 

VII,  and  VIII).  Methods  one  (regression)  and  two  (diver¬ 
gence)  each  require  a  Monte  Carlo  sample  of  the  simulation 
output  to  select  the  active  set.  Since  a  Monte  Carlo  sample 
can  provide  a  numerical  approximation  of  the  output  dis¬ 
tribution,  one  may  question  the  benefit  of  proceeding  with 
the  entropy  application.  However,  the  entropy  procedure 
provides  at  least  two  advantages.  First,  an  analytical 
representation  of  the  output  density  is  provided  for  ease 
in  subsequent  analysis.  As  previously  discussed,  the  spe¬ 
cific  information  functions  that  are  selected  for  this 
representation  may  provide  further  insight,  i.e.,  if  y  and 
y2  are  selected  then  the  output  distribution  is  approxi¬ 
mately  normal.  Secondly,  the  entropy  procedure  produces  a 
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distribution  that  compromises  between  the  sample  and  true 
underlying  distributions.  Thus,  the  entropy  method  pro¬ 
vides  a  closer  fit  to  the  underlying  unknown  density  than 
a  single  Monte  Carlo  run  can  provide.  Method  three 
(expected  values)  does  not  require  a  Monte  Carlo  sample, 
unless  such  a  sample  is  needed  to  compute  the  expected 
values,  and  thus  provides  the  added  benefit  of  reduced 
simulation  calls.  (However,  we  recommend  an  initial 
Monte  Carlo  run  when  using  method  three  to  ensure  that 
the  potential  set  includes  a  sufficient  number  of  functions 
(Chapter  VIII) . ) 

Given  the  initial  Monte  Carlo  sample,  subsequent 
Monte  Carlo  runs  are  needed  only  when  the  forms  or  bounds 
of  the  input  distributions  are  changed,  i.e.,  from  one 
distribution  to  another  such  as  from  normal  to  beta. 
Variations  in  input  distributions,  such  as  changes  in  mean, 
variance  or  distribution  parameters,  are  permissible  without 
reselecting  the  active  set.  Thus,  once  the  active  set  is 
defined,  sensitivity  analysis  may  be  accomplished  by  simply 
generating  expected  values  for  the  active  set  and  solving 
for  the  specific  p(y).  A  substantial  change  of  inputs 
(e.g.,  a  change  of  bounds  or  distribution  families)  may 
drastically  affect  the  form  of  the  output  distribution, 
and  active  set  selection  should  be  reaccompl ished  to  ensure 
accurate  approximations.  The  entropy  procedure  thus  offers 
a  significant  tool  for  output  analysis  and  potential 
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savings  in  computer  time  via  a  reduced  need  for  simulation 
access . 


Generate  Expected  Values .  Appendix  A  and  Chapter 
III  introduce  numerical  quadrature  for  generation  of 
expected  values.  Consider,  once  more,  our  simplified  simu¬ 
lation  model: 


f  (x)  on  [a,b]- 


F  (x) 


-f  (y)  on  [c,d] 


From  equation  (3.5)  we  have 


<g  i(y)>=M  I  W  g  (F  (x  • )  )  f  (x  . )  ,  i=l,2,...k, 

1  *  j=l  j  1  3  J 

where  VT  and  x ^ ,  j=l,...M  represent  quadrature  weights  and 
points,  and  f(x^)  is  the  value  of  the  input  density  at 
Xj.  A  study  of  this  approximation  for  <g^(y)>  surfaces  two 
significant  benefits: 

1.  Only  M  function  evaluations  (or  simulation 
calls)  are  needed  to  calculate  all  k  expected  values 

i=l,2,...k  That  is,  the  same  values  of  F(Xj)  are 
used  in  all  k  expected  values.  This  fact  alone  provides  a 
significant  improvement  to  Monte  Carlo  simulation. 

2.  The  M  simulation  calls  (F(Xj),  j=l,2,...M)  may 
be  stored  and  the  simulation  input,  f(x),  modified  (the 
interval  fa,b)  must  remain  fixed)  for  subsequent  sensitivity 
analysis  without  further  access  to  the  simulation.  This 
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benefit  provides  a  notable  analytical  tool  for  sensitivity 
analysis . 

The  listed  benefits  are  particularly  significant 
for  expensive  simulations.  However,  one  must  note  that 
the  benefits  of  quadrature  are  somewhat  reduced  when  multi¬ 
dimensional  integrals  are  involved,  i.e.,  when  the  input 
distribution  is  multivariate.  Quadrature  is  an  approxima¬ 
tion  to  analytic  integration  and  the  accuracy  of  approxi¬ 
mation  depends  on  the  number  of  quadrature  points,  M. 

While  M  may  be  small  for  one  dimensional  integration  (16  to 
32  points  offer  excellent  accuracy  in  most  cases),  the 
actual  number  of  simulation  calls  for  n  dimensional  integra¬ 
tion  is  Mn .  Consequently,  the  number  of  simulation  calls 

* 

for  large  n  can  rapidly  approach  the  number  of  calls  for  a 
Monte  Carlo  simulation.  Appendix  A  discusses  multidimen¬ 
sional  quadrature  and  provides  references  As  a  rule  of 
thumb,  quadrature  is  effective  when  the  number  of  input 
variables,  n,  is  less  than  or  equal  to  four. 

A  final  point  pertains  to  specification  of  the 
interval  for  output  approximation,  [c,d].  These  bounds  must 
be  known  prior  to  application  of  the  entropy  procedure.  If 
the  analyst  knows  reasonable  limits  for  the  simulation  out¬ 
put,  then  such  limits  should  be  used.  However,  both  quadra¬ 
ture  and  Monte  Carlo  methods  of  estimating  the  expected 
values  supply  a  means  to  estimate  [c,d].  For  the  Monte 
Carlo  approach,  the  bounds  are  simply  the  minimum  and 
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maximum  values  of  the  sample  distribution.  The  same  logic 


applies  to  quadrature,  i.e.,  [c,d]  =  [min  F  (x  .  )  ,  max F  (x  . )  ] 

i  1  i  1 

where  F(x^)  is  simulation  output  at  quadrature  point  x^. 

Solve  the  Constra int  Equations .  This  step  is 
thoroughly  discussed  in  Chapter  IV  to  include  a  computer 
subroutine  for  implementation.  The  constraint  equations 
are  repeatedly  solved  in  each  of  the  active  function  selec¬ 
tion  procedures.  Once  the  active  set  is  known,  then  subse¬ 
quent  output  approximations  for  the  same  input  family  may 
be  accomplished  by  generating  new  expected  values  and 
solving  the  constraints. 

Numerical  Usefulness 

The  entropy  procedure  provides  a  minimally  preju¬ 
diced,  stochastic  representation  of  the  simulation  output. 
The  approach  is  not  dependent  on  a  specific  simulation  but 
is  geared  to  "black  box"  models.  The  numerical  usefulness 
is  discussed  in  previous  sections  and  summarized  here. 

The  entropy  procedure  provides  an  analytic  form  for 
the  simulation  output  distribution  based  on  expected  values 
of  functions.  Expected  values  may  be  calculated  from  the 
known  input  distributions  with  limited  use  of  the  simula¬ 
tion.  Simulation  calls  may  be  stored  for  subsequent  analy¬ 
sis  of  output  sensitivity  to  input  changes.  When  a  Monte 
Carlo  sample  is  needed  to  identify  the  active  set,  the  Monte 
Carlo  work  does  not  have  to  be  reaccomplished  when  input 
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parameters  are  modified  (given  fixed  input  bounds).  Addi¬ 
tionally,  the  approach  offers  an  analytic  form  for  the  out¬ 
put  density  which  is  not  provided  with  straight  Monte  Carlo 
simulation.  The  entropy  approach  thus  provides  potential 
reduction  in  simulation  calls  and  stochastic  output  repre¬ 
sentation  . 

Example  Appl ication 

An  example  application  of  the  entropy  procedure  to 
simulation  is  given  for  the  AFFDL  problem  of  Figure  9.1. 

The  problem  is  described  in  the  first  section  and  followed 
by  a  detailed  discussion  of  entropy  results  using  active 
set  selection  method  one  (linear  regression).  Methods  two 
and  three  are  then  considered  and  show  very  similar  results. 
All  three  methods  provide  accurate  approximations  to  the 
data  . 

DESIGN 
PARAMETERS 

Weight 
Thrust 
Lift  Coeff. 


Weight 

Li f  t/Drag 

Spec.  Fuel 
Consumption 


;  i 

Design  | 

(Equations!  *"  Flight  Range 

I - 1 


Design 

Equations 


Takeoff  Distance 


PERFORMANCE 

MEASURES 


Fig-  q - 1  •  AFTDL  Simulation  Model 


r- 
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Simulation  Model .  AFFDL  contracted  for  a  study  of 
aircraft  performance  measures  for  several  different  aircraft 
with  different  engines  (Ref  59) .  Managers  were  concerned 
with  the  stochastic  nature  of  performance  measures  as 
related  to  the  stochastic  behavior  of  several  design  param¬ 
eters.  The  problem  of  interest  is  graphically  displayed 
in  Figure  9.1.  Our  example  involves  the  two  performance 
measures  in  Figure  9.1  (takeoff  distance  and  flight  range) 
and  the  indicated  design  parameters  for  one  engine  (TF2 — 

Ref  59) .  Variability  in  the  aircraft  production  process, 
as  reflected  in  the  stochastic  nature  of  design  parameters, 
means  that  no  two  aircraft  will  have  the  same  values  for 
performance  measures.  The  contract  goal  was  to  predict 
performance  measure  distributions  based  on  estimated 
design  parameter  distributions.  The  approach  was  Monte 
Carlo  simulation.  Aircraft  production  was  simulated  via 
traditional  design  equations  with  the  design  parameters 
as  simulation  inputs.  Various  distributions  were  assumed 
for  the  input  parameters,  and  a  sample  cumulative  was  pro¬ 
duced  for  each  performance  measure.  Constants  in  the 
design  equations  and  input  distributions  were  altered  to 
represent  different  type  aircraft  or  engines  and  compara¬ 
tive  sample  cumulatives  were  produced. 

To  demonstrate  the  entropy  method,  we  used  the 
contractor's  design  equations  (Table  IX. I)  and  input  distri¬ 
butions,  but  we  developed  our  own  simulation  models  (one 
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TABLE  IX. I 
PERFORMANCE  MODEL 

Takeof  f 

r  k=  l1-16 

distance  e  8 . 6 5 6 * [jTsT^Tt J  *  ,25'2 

~  Wempty  4  ''fuel  +  ''payload 
P  =  density  ratio 

S  =  King  surface  area 

=  Maximum  coefficient  of  lift 

T  =  Thrust 

Range 

range  =  (-£•)  *(^)*  In  Qy) 

W  +  w  ,  .  +  w 

empty  payload  reserve 

Cruise  speed 

Specific  fuel  consumption 
Lift/drag  ratio 

*NOTE:  Constant  8.656  later  corrected  by  contractor 

to  10.81.  We  retained  8.656  in  our  examples. 


1  5  3 


for  each  performance  measure) .  The  models  were  developed 
to  provide  experimental  control  but  also  proved  useful  in 
identifying  a  minor  contractor  error  in  one  design  equation 
(see  Table  IX. I).  We  use  the  equations  as  indicated  in 
Table  IX. I  without  correction.  For  a  specific  example  we 
consider  the  input  distributions  of  Table  IX. II,  i.e., 
normal  distributions  with  bounds  [u-4o,  p+4o]  where  p  is 
the  mean  and  o  is  the  standard  deviation.  Random  samples 
of  500  takeoff  distances  and  500  range  values  were  generated 
and  stored.  We  now  apply  the  entropy  procedure  to  approxi¬ 
mate  the  performance  measure  distributions. 

Expected  Values  and  Integration  Bounds .  The  entropy 
approach  requires  estimates  of  expected  values  of  informa¬ 
tion  functions  in  the  simulation  output  space.  The  esti¬ 
mates  may  be  provided  via  average  values  (Monte  Carlo)  or 
numerical  quadrature.  Each  simulation  in  Figure  9.1 
involves  three  independent  inputs  and  a  univariate  output, 
thus  suggesting  quadrature  for  expected  values  (Ref  Appendix 
A  and  Chapter  III).  We  use  a  16  point  quadrature  formula 
for  the  triple  integration  in  our  example.  Multidimen¬ 
sional  quadrature  is  discussed  in  Appendix  A.  Table  IX. Ill 
provides  a  comparison  of  quadrature  and  average  values  for 
simulation  output  means  and  variances  with  various  samples 

and  sample  sizes.  Average  values  were  computed  as  fol- 
N 

lows:  <g(t)>  '  l  a(t.)/N  where  N  is  the  sample  size  and 
i=l  '  1 
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TABLE  IX. II 


EXAMPLE  INPUT  DISTRIBUTIONS  AND  CONSTANTS 


Input 

Distribution* 

Mean=  v* 

Variances 2 

W 

empty 

Normal 

.  7427  (  +  6) 

.702944 (+9) 

m 

X. 

Normal 

.3017  (  +  6) 

.  149810  (  +  7) 

c 

Normal 

.2920  (  +  1) 

. 324863 (-2) 

CL /CD 

Normal 

.2030  (  +  2) 

.102478 

Ct 

Normal 

.63 

.  332522  (-5) 

Constant 

Va  1  ue 

w 

r  uel 

498,000 

lbs. 

W 

payload 

390,000 

lbs . 

p 

.944 

s 

11,270 

ft 

V 

460 

kns 

w 

reserve 

40,600 

lbs . 

*NOTE:  All  distributions  bounded  on  [u-4o,  u+4c] . 
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TABLE  IX. Ill 


COMPARISON  OF  QUADRATURE  AND  AVERAGE  VALUES 


Sample 

Minimum 

Maximum 

Measure 

Size 

Mean 

Variance 

Value 

Value 

Quadrature : 

Takeoff 

16 

6492.26 

72050.87 

5104.12 

8262.58 

Range 

16 

4880.56 

14962.69 

4185.34 

5701.30 

Averages : 

Takeoff 

1000 

6506.04 

70559.08 

5605.12 

7462.57 

Takeoff 

500 

6496.71 

70648.83 

5830.01 

7473.79 

Takeof  f 

500 

6491.58 

71019.91 

5745.75 

7240.77 

Takeoff 

500 

6539.59 

70677.8 2 

5603.24 

7242.84 

Takeoff 

100 

6520.41 

81384.04 

5629.48 

* 

Takeoff 

100 

6543.82 

83514.99 

5742.07 

* 

Takeoff 

100 

6570.19 

91038.30 

5750.68 

* 

Averages : 

Range 

1000 

4880.91 

14390.92 

4492.83 

5304.45 

Range 

500 

4887.31 

14965.55 

4484.93 

5230.29 

Range 

500 

4901.04 

14915.77 

4536.23 

5418.93 

Range 

500 

4900.85 

13364.17 

4585.49 

5347.85 

Range 

100 

4865.78 

14127.65 

4563.52 

* 

Range 

100 

4947.65 

15569.14 

4661.69 

* 

Range 

100 

4998.05 

15611.30 

4623.67 

* 

♦NOTE: 

Upper 

bounds  not 

retained  on 

this  run 
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is  the  output  sample  value.  As  expected,  a  sizeable 
fluctuation  is  noted  with  sample  size.  The  average  value 
appears  to  approach  the  quadrature  value  as  N  increases. 
Haber  (Ref  34)  provides  more  insight  to  quadrature  accu¬ 
racy  . 

Table  IX. Ill  also  compares  sample  and  quadrature 
output  bounds.  These  bounds  must  be  specified  for  con¬ 
straint  solution.  For  consistency,  the  bounding  method 
should  agree  with  the  method  of  estimating  expected  values, 
i.e.,  sample  bounds  if  averages  are  used.  Notice  that  all 
sample  bounds  are  subsets  of  the  quadrature  interval.  Our 
examples  use  quadrature  bounds. 

Method  One  (Linear  Regression) .  The  entropy  pro¬ 
cedure  requires  selection  of  the  active  set  of  information 
functions  from  a  predefined  potential  set.  As  in  previous 
examples,  we  use  the  Chapter  V  potential  set  (repeated  in 
Table  IX. IV  for  reader  convenience)  with  quadrature  values 
for  p ,  o,  and  [c,d].  Selection  method  one  (Chapter  VI) 


TABLE  IX. IV 

POTENTIAL  INFORMATION  FUNCTION  SET 


Fl  =  (x-u)/o  F4  =  ln(x-c)  F7  =  ( (x-u)  /a)  3 

?2  =  ( (x-p) /-) 2  F5  =  In (d-x)  F8  =  ( (x-u) /c) u 

F3  -  In  (x)  F6  =  (In (x-c) ) 2  F9  =  ln(x2+l) 

NOTE:  u  =  mean;  o  =  standard  deviation;  [c,d]  = 

bounds . 


♦»- 
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was  applied  to  our  random  samples  of  size  500  for  takeoff 
distance  and  range  simulation  outputs.  The  regression 
results  and  statistical  measures  are  listed  in  Tables  IX. V 
and  IX. VI  for  nine  candidate  sets.  The  function  sets  were 
chosen  based  on  largest  and  second  largest  adjusted  R2  for 
a  given  set  size.  We  allowed  up  to  six  active  functions 
per  set.  Notice  from  the  tables  that  different  statistics 
imply  different  active  sets.  Chapter  VI  and  Appendix  B 
provide  guidance  in  choice  of  statistic  for  final  active 
set  selection.  We  choose  the  Anderson-Darling,  A2,  sta¬ 
tistic  because  we  seek  accuracy  in  the  distribution  tails. 
Thus,  E4=(F2,F4,F7,F8)  is  the  active  set  for  takeoff  dis¬ 
tance  characterization  and  E2=(F2,F8)  is  active  for  flight 
range.  The  best  value  for  each  statistic  is  underlined 
in  the  tables. 

Let  us  consider  the  approximation  accuracy  for  the 
takeoff  distance  example  of  Table  IX. V.  The  value  of  A2 
is  .3193.  Stephens  (Ref  76)  provides  a  table  of  critical 
values  up  to  the  15%  significance  level.  Our  null  hypo¬ 
thesis  is: 

Hq :  The  sample  distribution  comes  from  a  popula¬ 
tion  with  distribution  function  that  is 
described  by  our  entropy  approximation,  p(x). 

Stephens'  table  gives  a  critical  value  of  A2=1.610  at  the 

15%  level  and  A2=1.933  at  10%  significance.  Thus  we  reject 

Hq  if  our  calculated  value  of  A2  exceeds  the  critical  value 

at  our  chosen  significance  level.  Our  extremely  small 
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values  of  A2  for  both  takeoff  distance  and  flight  range 
indicate  an  accurate  fit  at  significance  levels  much  higher 
than  15%.  Thus,  we  have  produced  accurate  approximation 
to  the  output  data  for  both  simulations. 

Figure  9.2  presents  an  entropy  and  sample  density 
comparison  for  takeoff  distance  to  demonstrate  the  accu¬ 
racy.  We  eliminate  numerical  differentiation  noise  by  com¬ 
paring  sample  and  entropy  cumulative  distributions  in 
Figure  9.3.  Differences  between  sample  and  entropy  cumula- 
tives  are  plotted  versus  the  actual  cumulatives  because  the 
entropy  fit  is  very  accurate.  Figures  9.4  and  9.5  provide 
the  same  information  for  the  flight  range  distribution  with 
active  set  E2 .  The  four  figures  show  an  accurate  fit  to 
the  sample  and  thus  imply  an  accurate  representation  of  the 
unknown  distributions. 

Active  set  selection  method  one  uses  linear  regres¬ 
sion  and  is  thus  sample  dependent.  As  demonstrated,  the 
method  provides  an  accurate  fit  to  a  given  sample.  However 
application  of  the  method  to  a  second  sample  from  the  same 
distribution  may  result  in  selection  of  different  active 
functions.  We  wish  to  show  that  the  active  set  (and  thus 
the  entropy  density)  for  a  given  sample  will  produce  an 
acceptable  fit  to  subsequent  samples.  If  a  given  sample  is 
a  good  approximation  to  the  true  underlying  analytic  dis¬ 
tribution,  then  the  entropy  distribution  which  approximates 
this  sample  must  also  approximate  the  analytic  distribution 
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Fig.  9.2.  Takeoff  Densities,  Sample  1  (N=500) 
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In  fact,  as  indicated  in  Chapter  VI,  the  entropy  distribu¬ 
tion  is  a  compromise  between  the  sample  and  analytic  dis¬ 
tributions.  Thus,  subsequent  sample  distributions  should 
be  well  approximated  by  the  entropy  fit. 

We  demonstrate  this  prediction  by  generating  addi¬ 
tional  samples  and  comparing  the  samples  to  our  original 
entropy  cumulative.  Figures  9.6  and  9.7  provide  cumulative 
comparisons  for  a  second  and  third  sample  (again  500 
deviates  each)  from  the  unknown  takeoff  distance  distribu¬ 
tion.  Figure  9.8  and  9.9  provide  equivalent  comparisons 
for  the  flight  range.  We  notice  that  subsequent  samples, 
particularly  for  the  takeoff  example,  fit  one  side  or  the 
other  of  the  entropy  approximation  indicating  that  our 
original  approximation  is  indeed  a  compromise.  To  further 
test  our  accuracy  with  the  original  entropy  approximation, 
we  generate  a  fourth  and  larger  random  sample  (1000  deviates) . 
Figures  9.10  through  9.13  graph  results  for  the  cumulatives. 
As  sample  size  increases,  the  sample  distribution  approaches 
the  entropy  distribution  (particularly  in  the  tails)  again 
indicating  that  the  entropy  distribution  is  a  good  approxi¬ 
mation  to  the  unknown,  underlying  analytic  distribution. 

Method  Two  (Divergence) .  We  apply  method  two  to 
the  original  takeoff  distance  sample  data  for  comparison 
with  method  one.  Method  two  uses  the  divergence  measure 
and  is  described  in  Chapter  VII.  This  method  also  produces 


166 


TAKEOFF  DISTANCE  SAMPLE 


SAMPLE  CUMULATIVE  MINUS 
ENTROPY  CUMULATIVE 


69 


SRMPLE  CUMULATIVE  MINUS 
ENTROPY  CUMULATIVE 
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Fiq.  9.10.  Entropy  Cumulative  (Sample  1)  Compared  to  Takeoff,  Sample 


Fig.  9.11.  Cumulatives  for  Entropy  (Sample  1)  and  Takeoff,  Sample 


.13.  Cumulatives  for  Entropy  (Sample  1)  and  Range,  Sample 


a  compromise  distribution,  but  experimentation  with  known 
distributions  has  shown  that  the  divergence  measure  favors 
a  fit  to  the  analytic  distribution  whereas  method  one 
favors  the  sample.  Method  two  includes  a  function  elimina¬ 
tion  step  which  is  not  currently  part  of  method  one.  The 
elimination  step  (subroutine  THROUT  from  Chapter  VII)  con¬ 
siders  the  entropy"  approximation  and  eliminates  "redundant 
information"  functions  in  an  "information  theoretic"  sense. 

The  results  of  method  two  when  applied  to  the  sample 
takeoff  distance  data  are  shown  in  Table  IX. VII.  Notice 
that  method  two  stops  adding  functions  at  set  F2,F7,F4 
because  the  addition  of  a  fourth  function  increases  diver¬ 
gence.  The  function  deletion  phase,  iteration  6  of 
Table  IX. Vl'I,  indicates  that  little  information  is  communi¬ 
cated  via  function  F7,  and  the  analyst  may  consider  elimin¬ 
ating  F7.  The  small  value  of  also  indicates  a  less  impor¬ 
tant  function.  We  chose  to  retain  ail  three  functions  in 
our  active  set.  Before  comparing  methods  one  and  two,  we 
consider  method  three. 

Method  Three  (Expected  Values) .  Method  three  (Ref 
Chapter  VIII)  does  not  use  the  sample  distribution  to  pro¬ 
duce  the  entropy  approximation,  but  concentrates  on  a  fit 
to  the  expected  values  of  the  potential  set.  Table  IX. VIII 
presents  the  results  of  method  three  for  the  takeoff  dis¬ 
tance  example.  This  table  highlights  two  interesting 
points .  Iteration  four  shows  that  adding  either  functions 
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TABLE  IX. VI I 


DIVERGENCE  METHOD  APPLIED  TO  TAKEOFF  DISTANCE  SAMPLE 


A.  Function  Addition  on  [x(l),x(N)]  = 

[5830.,  7473 

•  8] 

Iteration 

Function  Set 

Divergence 

1 

2 

J  (p2 

(x) ,f  (x)  )  = 

.044093 

2 

2,7 

J(P27 

(X)  ,f  (x)  )  = 

.042295 

3 

2,7.4 

J (p274 

(X) ,f (X) )  = 

.038427 

4 

2, 7, 4, 6 

J(P2746 

(X)  ,f  (X)  )  = 

.038841 

where 

f (x)  is  sample  density 

B. 

Function  Deletion 

on  Quadrature  Bounds  = 

1  5104 .1 , 

8262.6] 

Iteration 

Divergence 

Action 

5 

J (p  (x)  ,p74 (x) ) 

=  652.8 

Retain  Function  2 

6 

J (p (x)  ,p24  (x) ) 

=  .00532 

Retain  Function  7 

7 

J  (p  (:;)  ,p27  (x)  ) 

=  10.05 

Retain  function  4 

where 

P  <x)  =  P274  <x) 

C.  Active  Set  =  F2.F4.F7 

X 0  =  4 .648  ,  X  2  = . 5  0  8  6  , 

X4=.2572,  X 

7=- . 01 88 

0188 


TABLE  IX. VI I I 


1 


METHOD 

THREE 

APPLIED  TO  TAKEOFF  DISTANCE 

EXAMPLE 

A.  Function  Addition 

Iteration 

Function  Set 

M 

1 

2 

.015916 

2,7 

.  003607 

3 

2,7,5 

.  000435 

A 

2, 7,5,8 

2.1  (-10) 

n  7  S  4 

9.9  (-  7) 

B. 

Function  Deletion  (THROUT) 

Iteration 

Divergence 

Action 

5 

J(p  (x ) 

,p758  (x )  )  =  107.9 

Retain 

function 

6 

J (p (x) 

, P 2 5 8  (x) )  =  . 0054G 

Retain 

function 

*7 

7 

J  (p  (x) 

,P278(x) )  =  2.326 

Retain 

function 

5 

8 

J  (p(x) 

,p275(x))  =  .000082 

Reta in 

function 

8 

'  q  =  9 . 0  4  5  ,  \  2  =  .  4  9  4  0  ,  \  =-.  3386,  X?=-.0192,  Xg=. 00084 

C.  Active  Set  F2,F5,F7,F8  or  F2,F5,F7 


F8  or  F4  will  produce  an  acceptably  small  M2 .  Thus,  more 
than  one  combination  of  functions  can  produce  an  acceptable 
approximation.  We  chose  F8.  Notice  in  iteration  eight  that 
deletion  of  F8  produces  very  little  information  loss,  and 
the  analyst  may  prefer  to  use  active  set  F2,F5,F7,  i.e., 
set  Xg=0.0.  These  points  again  indicate  the  need  for 
analyst  involvement  in  the  selection  procedure.  All  three 
selection  methods  are  designed  as  analyst  tools  and  not 
as  stand-alone  computer  programs. 

A  summary  of  results  for  the  three  methods  is  pro¬ 
vided  in  Table  IX. IX.  The  three  methods  produce  such 
close  results  that  graphs  of  the  distributions  are  inade¬ 
quate  for  distinction.  Thus,  Table  IX. X  is  included  to 
provide  a  comparison  of  cumulative  distribution  values  at 
18  data  points. 


TABLE  IX. IX 

COMPARISON  OF  THREE  ENTROPY  METHODS  FOR 
TAKEOFF  DISTANCE  DISTRIBUTION 


Method 

Approximation  Interval 

Active  Set 

Regression 

5830.0, 

7473.8 

2, 4, 7, 8 

Divergence 

5104.1, 

8262.6 

2,4,7 

Expected  Values 

5104 . 1 , 

8262.6 

2, 5, 7, 8  (2,5,7) 

8 


AD-AQ83  514 


UNCLASSIFIED 

3  o*  3  1 


AIR  FORCE  INST  OF  TECH  WRI6HT-PATTERS0N  AFB  OH  SCHOO— ETC  F/6  12/1 
CONTINUOUS  DENSITY  APPROXIMATION  ON  A  BOUNDED  INTERVAL  USING  IN— ETC(U) 
MAR  80  J  E  MILLER 

AFIT/DS/MA/80-1  NL 


■£Z1 


Summary 

This  chapter  has  discussed  the  application  of  our 
entropy  procedure  to  computer  simulation.  The  three  active 
set  selection  methods  were  used  on  a  specific  example  and 
produced  consistently  similar  results.  The  choice  of  method 
is  left  to  the  analyst;  previous  chapters  provide  guidance. 
The  entropy  method  provides  an  excellent  tool  for  distribu¬ 
tion  characterization  and  a  viable  tool  for  sensitivity 
analysis . 

The  simulation  application  was  presented  in  detail 
because  such  a  general  procedure  may  be  applied  to  numerous 
stochastic  modeling  problems.  The  method  treats  the  simula¬ 
tion  nke  a  "black  box."  Thus,  if  the  analyst  can  formulate 
his  problem  as  a  special  case  of  the  "black  box"  model  then 
he  may  apply  the  entropy  procedure. 

As  a  final  note,  the  simulation  input  distributions 
were  provided  in  our  examples  and  must  be  known  to  imple¬ 
ment  a  simulation.  However,  the  entropy  characterization 
procedure  is  also  useful  in  defining  input  distributions. 
Chan  (Ref  10)  suggests  such  an  application  for  a  special 
case  of  our  general  model.  Other  entropy  applications  are 
discussed  in  Chapter  XI. 
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Chapter  X .  Sensitivity 


Introduction 

The  following  sections  address  several  aspects  of 

sensitivity  analysis.  We  summarize  sensitivity  issues  from 

previous  chapters  on  simulation  and  active  set  selection. 

The  central  concern  of  this  chapter  is  sensitivity  of  the 

entropy  approximation,  p (x) =expl-XQ-X1g1 (x)-. . .X^g^  (x)] , 

to  errors  in  the  expected  value  vector,  <G>=  (<90>  ,<9]*  •  -  •  • 

T 

<g^>)  where  there  are  k  functions  in  the  active  set. 

(For  notational  convenience  only,  we  do  not  include  vari¬ 
able  x  in  our  “'-pected  value  symbols,  i.e.,  we  let 
<gi  (x)  :>=<gi>.  )  We  present  theoretical  developments  from 
the  literature  as  well  as  two  numerical  procedures  for 
studying  approximation  sensitivity. 

Simulation  Sensitivity 

As  R.  E.  Shannon  states, 

Sensitivity  analysis  is  one  of  the  most  important 
concepts  in  simulation  modeling.  By  this  we  mean  deter¬ 
mining  the  sensitivity  of  our  final  answers  to  the 
values  of  the  parameters  used  [Ref  70:32], 

Simulation  is  designed  to  facilitate  sensitivity  analysis 
because  the  analyst  has  complete  control  over  the  param¬ 
eters  (or  inputs)  and  can  vary  them  one  at  a  time  (or 
jointly)  to  observe  the  effect  on  simulation  output.  In 
fact,  the  AFFDL  problem  of  Chapter  IX  was  solved  via 


simulation  to  answer  a  sensitivity  question,  i.e.,  how  do 
design  measures  vary  as  input  design  parameters  change? 

Our  entropy  procedure  provides  -an  effective  tool 
for  output  comparisons  by  producing  an  accurate  description 
of  the  output  distribution.  The  inputs  may  then  be 
altered  and  a  second  entropy  approximation  generated  for 
density  or  cumulative  comparisons.  Frequently,  graphs  of 
the  resulting  output  distributions  will  answer  the  ana¬ 
lyst's  sensitivity  questions.  While  the  entropy  procedure 
provides  graphs,  it  provides  additional  insight  to  simula¬ 
tion  sensitivity.  Notice  that  the  entropy  approximation, 
p(x),  is  based  on  the  expected  value  vector  <G>,  and  changes 
in  the  inputs  cause  subsequent  changes  in  <G>.  Given  a 
<G>  vector  for  the  output  of  a  simulation  with  specified 
inputs,  we  may  study  the  sensitivity  of  p(x)  to  variations 
in  this  <G"  vector,  and  this  study  is  accomplished  without 
using  the  simulation.  Once  we  have  established  acceptable 
bounds  for  the  output  <G>  vector,  we  may  return  to  the  simu¬ 
lation  to  investigate  the  effect  on  <G>  of  varying  simula¬ 
tion  inputs.  Procedures  for  evaluating  the  sensitivity  of 
p(x)  to  vector  <G>  will  be  presented. 

Active  Set  Selection 

Chapters  VI,  VII,  and  VIII  discuss  three  methods 
for  selecting  the  active  set  of  information  functions  from 
a  predefined  potential  set.  Methods  one  (linear  regression) 


and  two  (divergence)  use  random  samples  of  the  unknown  dis¬ 
tribution  to  develop  the  active  set  and  demonstrate  sample 
dependence.  While  sample  sensitive,  the  methods  produce 
approximations  that  compromise  between  the  sample  and  true 
underlying  distributions.  The  point  of  significance  is 
that  this  compromising  quality  insures  an  adequate  fit  to 
subsequent  samples  with  the  active  set  from  a  previous 
sample.  Previous  chapters  provide  conceptual  and  experi¬ 
mental  justification.  Method  three  (expected  values) 
selects  the  active  set  based  on  expected  value  information 
and  is  sensitive  to  error  in  this  data.  The  three  methods 
thus  demonstrate  data  sensitivity.  This  data  sensitivity 
is  a  desired  property  and  enables  accurate  approximations. 

The  importance  of  specifying  a  broad  potential  set 
was  presented  in  Chapter  V  and  represents  another  form  of 
sensitivity.  If  the  potential  set  contains  the  correct 
functions,  we  can  exactly  recreate  the  unknown  analytic 
distribution  as  previously  demonstrated.  The  accuracy  of 
approximation  depends  heavily  on  specifying  enough  infor¬ 
mation  via  the  potential  set.  The  potential  set  of 
Chapter  V  provides  an  excellent  starting  point  and  proved 
quite  accurate  in  experimentation. 

Approximation  Sensitivity 

The  term  "approximation  sensitivity"  is  used  here 
to  describe  the  sensitivity  of  the  approximation  density, 
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p(x),  to  errors  in  the  expected  values  vector  <G>.  We 
assume  that  the  active  set  has  been  selected  and  the  infor¬ 
mation  functions  are  specified  and  fixed.  Our  concern 
centers  on  how  changes  in  the  expected  values  of  the  active 
set  produce  changes  in  p(x).  We  restate  the  problem  in  the 
Lagrange  formulation,  discuss  theoretic  implications,  and 
then  present  two  methods  for  studying  this  sensitivity. 

The  Problem .  The  maximum  entropy  procedure  approxi 
mates  the  unknown  density,  f (x) ,  by  a  density  of  the  form 

p ( x )  =  exp  [-X0-X1  (*> -• • •  \  gk (x) 3 

where  the  g^(x),  i=0,l,...k,  represent  our  active  set  of 
information  functions  with  gQ(x)=l.  The  lambda  vector  , 

A= (> o , xi , • • • xk>  '  identifies  the  specific  p(x)  and  is  deter 
mined  by  solving  a  system  of  nonlinear  constraint  equations 


r  ~i 

r-  k 

r—  — 

f  ( A ) 
lo 

/g  (x)  exp [ -  I  X.g.(x)]dx 
u  i=0  1  1 

<go" 

• 

k 

Lfk(A)J 

/g.  (x)  exp  [-  l  X.g.(x)]dx 
L  K  i=0  1  1  J 

where  the  <G  vector  is  provided  via  quadrature,  average 
value  approximation,  or  other  means.  The  only  unknown  is 
Thus  our  approximation  problem  is  transformed  to  a 
problem  of  selecting  a  A  vector,  based  on  a  given  <G:  , 
such  that  the  resulting  p(x)  satisfies  the  constraints. 


ri.  a. 


Assuming  a  "wise"  choice  of  information  functions,  p(x) 
will  provide  an  adequate  representation  of  f  (x) . 

Sensitivity  analysis  is  defined  as  an  evaluation  of 
the  change  in  system  output  effected  by  a  systematic  vari¬ 
ation  of  system  inputs.  Our  system  is  described  by  equa¬ 
tions  (10.1)  with  input  <G>  and  output  A  or  p(x).  We  thus 
consider  sensitivity  at  two  levels;  the  sensitivity  of  A 
to  <G>,  and  the  sensitivity  of  p(x)  to  <G^. 


Theoretical  Support .  The  A  vector  defines  the 
explicit  p(x)  for  a  particular  set  of  constraint  values, 
<G>.  The  Lagrange  multiplier  formulation  of  our  problem 
provides  some  immediate  information  on  A  sensitivity  to 
<G>.  As  discussed  by  Tribus,  Jaynes,  Crain  (Refs  82;  42; 
15),  and  others,  we  may  consider  AQ  as  a  function  of  A^, 
i=l,2,...k.  Following  Tribus'  work  with  discrete  entropy, 
we  consider  the  first  constraint,  f  (A ) =<gg >=1 ,  to  produce 


exp [A  ]  =  /exp [-  Z  A  g  (x)  )  dx 

u  i_l  i  a 


(10.2) 


or 


>0^1, ...Ak)  =  lnt/expl-  Z  A  i  (x)  ]  dx] 


Differentiation  of  (10.2)  produces 


9A-/3A  =  -<g  > 

0  m  *m 


(10.3) 
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Equation  (10.3)  states  a  linear  relationship  between  XQ 

and  A  which  is  weighted  by  <g  >.  This  tells  us  that 
m  m 

X 0 (A i , • . - A^) ,  called  the  "potential  function"  by  Jaynes  and 

Tribus,  is  most  strongly  effected  by  variations  in  the 

larger  elements  of  <G>.  We  would  thus  expect  that  q  (x) 

m 

with  large  | <g  >|  would  strongly  effect  p(x).  While  pro¬ 
viding  conceptual  insight,  equation  (10.3)  provides  little 
practical  sensitivity  information  for  our  procedure.  We 
scale  the  information  functions  in  our  application  to 
reduce  the  values  of  !<g^>|,  i=l,2,...k.  The  scaling  is 
a  numerical  convenience  but  also  enhances  the  performance 
of  the  three  active  set  selection  methods.  Let  us  pursue 

the  theoretical  relationship  between  A  and  <G' . 

T 

Given  an  1  =  ( >1 Q  f  ^  ,  .  .  .  ^)  variation  in  <G>  with 

T 

£q  =  0  (gQ(x)=l),  we  want  to  find  f =  (cQ , ^ ,  .  . . 6^)  where  6 
is  the  variation  produced  in  A .  We  again  extend  the  dis¬ 
crete  example  of  Tribus  (Ref  82).  Consider  the  p(x)  which 
produces  maximum  entropy  smax(P(x)).  Then 


As  expected,  the  entropy  is  a  function  of  the  A  and  <G> 
vectors.  If  we  consider  the  <g^>  to  be  the  independent 


variables  in  equation  (10.4),  we  find  3S  (<G> ) /9<g . >=A . , 

max  ’3  3 ' 

j=0,l,...k  or,  subsequently,  3*  j  /  3  >=oA^/3<g_.>  .  These 

equations  highlight- the  interdependence  of  the  elements  of 
A  and  <G>,  i.e.,  a  perturbation  such  as  C= (0 , . . 0, £  , 0 , . . 0) , 
Cra^0,  may  result  in  a  6  with  all  nonzero  elements.  The 
theory  provides  insight,  but  fails  to  provide  a  viable 
means  of  examining  the  sensitivity  of  A  to  <G>  and  subse¬ 
quently  of  p(x)  to  <G>. 


Sensitivity  of  A  to  <G  >.  We  are  given  a  nominal 
<G>  vector,  'G'q,  and  solve  equations  (10.1)  for  A-  such 
that  F  q)=<G>q.  From  Theorem  4.3,  if  A-  exists  then  it 
is  locally  unique,  F(A)  is  one  to  one  in  some  neighborhood 
of  Aq ,  and  F  1  exists  with  F  3(<G>q)=Aq.  Given  perturba¬ 
tion  A  ,  we  wish  to  find  vector  f  such  that  F  (."  q+  ‘ )  =<G  -  0+S  . 

The  usual  approach  to  such  a  problem  (Refs  14;  19; 
70;  80;  63)  is  to  vary  a  single  element  of  <G''q  and  calculate 
6.  Then  reset  <G'q,  vary  a  second  element  of  <G'q,  compute 
the  respective  6,  and  continue  to  iterate.  The  result  of 
this  brute  force  linear  approximation  is  an  approximation 
to  the  partial  derivatives 

SA  J 


These  partials  can  provide  an  indication  of  the  strength 
of  a  specific  information  function.  A  large  value  of 


9  X , 


’  3<v< 
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3>.j/3<g^>Q  for  one  or  more  values  of  j  indicates  that  a 
small  error  in  <g^>Q  maY  cause  a  large  change  in  A  and  thus 
p(x).  This  suggests  additional  effort  to  insure  accuracy 
for  the  subject  <g^>Q. 

The  linear  approximation  provides  useful  information 
about  A  sensitivity  and  is  easily  accomplished  with  the 
computer  programs  of  Chapter  IV.  However,  the  linear 
approach  has  a  recognized  weakness  (Refs  19;  20;  68;  80) 
in  that  constraint  coupling  has  not  been  fully  considered. 

We  must  include  simultaneous  variations  of  all  <g^>, 
i=l,2,...k,  gg(x)=l,  and  observe  the  effect  on  A.  We  extend 
the  linear  investigation. 

Define  a  k  dimensional  rectangle,  Rn(<G>Q>,  about 
vector  <G'q  (k  dimensions  versus  k+1  because  <gg>=l).  This 
rectangle  is  a  function  of  parameter  a  where  each  side  of 
the  rectangle  is  an  interval,  [<gi>0-a<gi>0 ,  <gi>0+a<gi>0) , 
i=l, 2,... k.  Thus,  a  denotes  a  confidence  in  our  estimation 
of  the  expected  values.  By  sampling  from  (<G>q)  and  com¬ 
puting  A  =  (Ag  +  f ) ,  we  may  investigate  the  shape  of  the  k  +  1 
dimensional  rectangle  that  is  generated  about  A^.  For 
example,  we  may  record  the  maximum  deviation  for  each  ele¬ 
ment  of  A Q  (i.e.,  the  maximum  value  of  6^  for  each  i)  that 
results  from  the  allowable  perturbations  of  <G>q.  "Large" 
values  of  may  suggest  a  reduction  in  a.  This  approach 
has  ootential  use  for  placing  a  bounds  on  <GNg  when  the 
maximum  allowable  6^,  i=0,l,...k,  are  known  or  hypothesized. 
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The  scheme  provides  a  starting  point  for  a  practical  attack 
on  our  central  concern;  how  do  errors  in  <G>Q  affect  the 
entropy  approximation,  p(x>? 

Sensitivity  of  p (x)  to  <G> .  The  previous  sections 
relate  sensitivity  concepts  from  the  literature  and  our 
research.  They  provide  conceptual  and  theoretic  insights. 
This  section  directly  attacks  the  sensitivity  question  and 
describes  a  practical  procedure  for  investigating  the  vari¬ 
ation  in  p(x)  due  to  errors  or  changes  in  <G>g. 

As  in  the  previous  section,  we  are  given  <G>g  from 
which  we  find  Ag  and  thus  p{x).  We  select  a  to  define 

(  <G  )  and  sample  from  ^^G^g)  to  produce  <G>=  ( 1 .  ,  <g^  > , 

T 

...<gk>)  where  <gi"  is  in  f  <9i  :>0~a<gi  "q  '  <gi>0  +  a<9i>01  ’ 
Thus,  <G>  is  composed  of  k  independent,  uniformly  distri¬ 
buted  random  variables,  <g^>,  i=l,2,...k,  and  a  constant, 
<§g>=l.  Generation  of  the  <G^  samples  is  accomplished  by 
sampling  from  k  uniform  distributions  with  the  stated  a 
bounds.  Each  <G>  vector  results  in  a  A  which  defines  a 
p(x),  i.e.,  a  perturbation  of  p(x). 

The  sample  space,  Rq  (<G>g) ,  is  specified  as  a  func¬ 
tion  of  a  and  the  expected  value  vector  <G>g.  Our  pro¬ 
cedure  is  designed  to  place  confidence  bounds  about  p(x) 
based  on  predefinition  of  the  sample  space.  We  generate 
N  samples  of  <G>  and  produce  the  corresponding  N  densities, 
p(x).  Each  p(x)  is  a  continuous  density  from  the  same 
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entropy  family  as  p(x)  and  on  the  same  approximation  inter¬ 
val.  To  bound  p(x),  we  specify  M  points  on  the  interval 
of  approximation  and  consider  the  maximum  deviations, 
above  and  below  p(x),  achieved  by  the  sample  densities. 

This  approach  specifies  an  upper  and  lower  bound  on  the 
nominal  density,  p(x),  as  a  function  of  u  for  a  given  <G>q. 

Figures  10.1,  10.2,  and  10.3  present  the  results 
of  this  procedure  when  applied  to  the  beta  distribution  of 
previous  chapters  (M=50,  N=500).  As  expected,  the  bounds 
on  p(x)  grew  as  a  increases.  Similar  results  are  shown 
in  Figures  10.4  through  10.8  for  the  normal  distribution; 
p  ( x )  =exp  [  - 1  q  -  >■  j  (  x-  •_  )2  /  r  : )  ],i  =  10.,  c:=2.  For  the  normal 
example  we  allowed  <  (x-u ) 2 /c 2 >  to  vary.  This  approach  is 
effective  for  determining  a  reasonable  :»  bound  on  the  <G>0 
vector,  i.e.,  for  specifying  an  accuracy  bound  on  the  data. 
For  the  beta  example  we  conclude  that  an  a  greater  than 
.05  produces  unsatisfactory  approximation  error.  Other 
analysts  may  be  more  or  less  tolerant.  The  normal  example 
demonstrates  less  sensitivity.  We  must  insure  that  suffi¬ 
cient  emphasis  is  placed  on  data  collection  and  calculation 
of  <G "0  to  accomplish  the  desired  level  of  data  accuracy. 

The  above  sensitivity  model  offers  a  practical 
means  of  evaluating  the  sensitivity  of  p(x).  The  results 
will  depend  or.  the  form  of  p(x),  i.e.,  the  information  func¬ 
tions  in  p(x),  and  the  values  of  <G>q  and  a.  Thus,  we 
. nnnot  state  aeneral  sensitivity  results  that  pertain  to 
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10.2.  Beta  Approximation  Bounds,  Alpha 
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Fig.  10.5.  Normal  Approximation  Bounds,  Alpha 


Fig.  10.7.  Normal  Approximation  Bounds,  Alpha 


all  approximations  or  to  all  problems.  Instead,  we  pro¬ 
vide  a  procedure  which  the  analyst  may  use  on  his  specific 
problem.  Sensitivity  conclusions  derived  from  this  pro¬ 
cedure  will  be  subjective.  The  procedure  is  easy  to  imple¬ 
ment  and,  once  <G>q  is  specified,  does  not  require  addi¬ 
tional  access  to  the  unknown  distribution.  The  entropy 
approach  has  thus  provided  sensitivity  insight  that  other 
characterization  procedures  lack. 

Various  modifications  of  our  sensitivity  model  are 

feasible.  For  example,  the  analyst  may  consider  maximum 

approximation  error,  i.e.,  max | p (x ) -p (x ) | .  He  may  then 

x 

select  a  bounds  that  insure  max | p (x) -p (x) |  less  than  some 

x 

epsilon.  Secondly,  the  effect  of  a  single  expected  value 
<g  >,  may  be  evaluated  by  using  the  same  approach  but  with 
other  elements  of  <G>0  fixed.  A  third  variation  results  if 
additional  information  exists  about  the  accuracy  of  <G>Q. 

For  example,  we  have  assumed  a  uniform  error  for  each  ele¬ 
ment  of  <G'q.  The  analyst  may  know  that  the  error  is  better 
approximated  by  another  distribution.  The  above  procedure 
may  still  be  used  but  with  <5>  vectors  produced  from  the 
known  error  distribution.  Finally,  the  analyst  may  prefer 
to  compare  entropy  cumulative  distributions,  P(x)  and  P(x), 
versus  densities.  The  same  sensitivity  model  may  be  used 
but  with  bounds  now  produced  for  P(x).  We  consider  a  second 
model  for  sensitivity  investigations  in  the  next  section. 
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Entropy  Approximation  for 
Sensitivity  Measures 

A  procedure  to  investigate  the  sensitivity  of  p(x) 
to  error  in  <G>q  has  been  presented.  We  use  this  sensitiv¬ 
ity  model,  information  theory  concepts,  and  our  entropy 
approximation  procedure  to  develop  a  second  sensitivity 
model.  The  second  model  demonstrates  another  application  of 
our  entropy  approximation  procedure. 

In  previous  sections  we  defined  the  given  data, 

<0*0,  which  produces  entropy  approximation  p(x).  A  pertur¬ 
bation  of  <G'q  produces  vector  <G'-  which  results  in  a 
second  density,  p(x).  How  may  we  "measure"  the  variation 
in  p(x)  due  to  the  perturbation  in  <0*0?  We  have 

4 

k 

p(x)  =  exD  f-  I  •  g  ■  (x)  ]  and 
i=0  1  1 

k 

p  (x)  =  exp  !-  I  6  .  g.  (x)  ] 
i=0  1  1 

where  £^  =  \^+i^  and  <g^*=<g^>Q+? ^ ,  i=0,l,...k;  £q=0.  A  use¬ 
ful  measure  of  variation  between  densities  is  divergence, 
J(p(x),p(x))  (Chapters  II  and  VII).  Thus, 

J  (p(x)  ,p(x)  )  =  /  [p  (x) -p  (x)  ]  In  [p(x)/p(x)3  dx 

=  /  [p  (x)  -p  (x)  ]  I  —  r  X  ±  gi  (x)  +  ffigi(x)J  dx 
=  /I  i)gi  (x)p  (x)  dx  -  /:  (£i-Xi)gi  (x)p(x)  dx 
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=  ^  ~  ^  i~>i i) <gi> 


=  r  (S  )  (<g.>n  -  <g  .  >  ) 

i  i  0 

-  I(Xi+ti-X.)(<gi>0-<g.>0-?.) 

J (p (X) ,p  (x)  )  »Z(«i)(-Ci)  =  -STC  (10.5) 


where  all  summations  are  for  i=0  to  k.  Equation  10.5 
allows  rapid  computation  of  divergence  and  thus  a  measure 
of  information  loss  when  p(x)  is  replaced  by  p(x). 

Combining  the  divergence  measure  with  the  concepts 
of  sensitivity  model  one,  we  create  a  second  sensitivity 
model  in  the  form  of  a  simulation: 


<G> - ►  H(<G>) 


J (p (x)  ,p  (x)  ) 


=  v 


Vector  <G>  is  an  independent,  multivariate,  uniformly  dis¬ 
tributed  input  random  vector  (that  depends  on  a ) ,  and 
J(p(x),p(x))  is  the  univariate  simulation  output. 

Chapter  IX  discussed  the  use  of  our  entropy  approximation 
procedure  for  simulations  of  this  form.  Given  a  which 
defines  our  input  distribution,  we  wish  to  determine  the 
sensitivity  of  p(x)  in  terms  of  the  measure  J (p  (x) ,p (x) ) . 
Application  of  our  entropy  procedure  to  the  output  of 
model  two  provides  a  complete  representation  of  the  density 
of  J (p (x) ,p (x) ) ,  Pj (y ) ,  for  the  given  a.  Knowledge  of  Pj(y) 
enables  the  development  of  confidence  bounds  and  statistical 
statements  about  the  divergence,  i.e.,  the  probability  that 
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divergence  is  less  than  Q  for  the  given  a  is  pT(y)  dy. 
Since  divergence  measures  variation  in  p(x),  then  diver¬ 
gence  is  a  viable  sensitivity  measure.  Thus,  the  simula¬ 
tion  approach  has  provided  a  means  to  numerically  quantify 
the  sensitivity  question. 

As  an  example,  the  model  was  used  for  the  beta  dis¬ 
tribution  or  previous  sections  with  a=.l.  We  generated  500 
sample  input  vectors,  < G> ,  and  calculated  the  correspond¬ 
ing  divergence  values.  Method  three  (Chapter  VIII)  was 
applied  for  the  entropy  characterization  of  p  ( y >  .  Figure 
10.9  shows  the  entropy  and  sample  divergence  densities. 

The  sample  density  was  computed  by  numerical  differentia¬ 
tion  of  the  sample  cumulative.  With  p„(y)  known,  wa  con¬ 
sider  the  impact  or  errors  in  <G>q?  i.e.,  given  that  <G>q 
producer  p(x)  and  <G>  produces  p(x),  the  probability  that 
the  divergence  between  p(x)  and  p(x)  is  less  than  .05  for 
all  <G>  in  is  /g'0^Pj(y)  dy.  For  our  example, 

Prob  (J£.05)  =  .608  and  Prob  ( J<.  5 )  = .  983  where  Prob  ( J<cc)  =1 . 
Experimentation  has  shown  that  a  divergence  of  .05  produces 
an  "acceptable"  fit  between  p(x)  and  p(x).  For  our  example, 
however,  we  have  a  40%  chance  of  exceeding  a  .05  diver¬ 
gence  with  a  of  .1.  As  with  the  first  sensitivity  model, 
we  conclude  that  a=.l  is  too  large. 

The  sensitivity  model  was  described  in  terms  of  the 
divergence  measure.  Divergence  provides  an  excellent  rela¬ 
tive  measure,  and  we  know  that  divergence  "near"  zero 
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indicates  "small"  information  exchange.  We  may  readily 
determine  which  elements  of  <G>q  are  most  influential  by 
systematically  varying  a  single  element  and  calculating 
divergence.  The  element  which  produces  the  largest  diver¬ 
gence  is  the  most  influential.  We  may  also  compare  diver¬ 
gence  densities  or  probability  statements  for  different 
values  of  n.  For  example,  we  may  require  Prob  ( J<_.  05 )  >  .  90 
and  experiment  to  find  the  appropriate  a.  The  weakness  in 
our  model  is  that  we  cannot  define  a  statistical  meaning 
for  a  given  value  of  divergence,  i.e.,  is  J  (p  (>:)  , p  (x)  )  =  .  1 
an  acceptable  error?  However,  the  sensitivity  model  was 
designed  for  flexibility  and  provides  an  alternative. 

Divergence  may  be  replaced  with  mere  popular  mea¬ 
sures  m  the  sensitivity  model.  The  goodness  of  fit  mea¬ 
sures  of  Chapter  VI  and  Appendix  B  (Kclmogorov-Smirnov , 
etc.)  are  examples  of  measures  that  provide  better  sta¬ 
tistical  quantification  cf  sensitivity  information.  We 
consider  the  Kolmogorov-Smirnov  statistic  as  an  example, 
n-sun 1  C.,  (x)  -P  (x )  j  where  C.,(x)  represents  the  sample  cumula 

X 

tive  of  size  N  for  the  unknown  distribution.  The  D  sta¬ 
tistic  may  be  used  to  test  the  hypothesis  that  C^.lx)  was 
taken  from  distribution  P(x)  where  P(x)  is  the  entropy 
cumulative.  Thus,  we  are  given  <G''Q  and  sample  (x)  from 
an  unknown  distribution.  Vector  <G'>q  produces  the  approxi 
nation  density  p(x)  and  subsequently  P(x). 


We  define 


Vector  <GS  produces  p{x),  and  we  may  calculate  D (C  (x) ,P (x) ) 
and  D  (C  (x),P  (x)  )  .  Our  sensitivity  model  is  as  follows: 

D  (Cn(x)  ,P  (x)  )  =  y 

We  may  again  apply  the  entropy  approximation  procedure  to 
produce  the  density  pD (y) .  Thus,  for  a  given  sample,  C  (x) , 
and  a  given  a,  we  are  able  to  make  significant  probability 
statements,  i.e.,  Prob  (D_fQ)  =f®  pQ  (y)  dy.  If  Q  represents  the 
critical  value  of  the  D  statistic  for  a  given  significance 
level,  then  we  have  calculated  the  probability  of  not 
exceeding  Q  for  all  <G>  which  are  elements  of  Ra(<G>p). 

We  may  thus  relate  .he  error  in  <G >Q ,  determined  by  a,  to 
a  probability  of  accepting  the  hypothesis  that  a  sample  from 
the  unknown  distribution  is  a  sample  from  our  approximation 
distribution.  While  the  D  and  other  statistics  are  more 
difficult  to  calculate  than  divergence,  the  statistical 
meaning  that  results  may  be  worth  the  effort. 

Summary 

We  have  touched  on  several  aspects  of  sensitivity 
while  concentrating  on  sensitivity  of  the  approximation 
density,  p(x),  to  errors  in  the  expected  value  vector, 

<G>0 .  Two  general  sensitivity  models  were  explored  with 
examples.  One  model  resulted  in  an  upper  and  lower  bound 
on  p(x)  as  a  function  of  an  a  error  bound  on  <G>q.  The 
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Chapter  XI .  Other  Applications 

Introduction 

The  entropy  procedure,  as  presented  in  Chapter  III, 
is  a  flexible  tool  for  characterizing  or  approximating 
unknown  distributions.  Application  of  this  procedure  to 
computer  simulation  has  been  discussed.  However,  the 
generality  and  flexibility  of  the  method  enables  wide  appli 
cation.  Three  examples  are  presented  in  this  chapter  to 
demonstrate  potential  use  and  with  the  intent  of  stimu¬ 
lating  thought  for  other  applications. 

Cumulative  Data  Versus 
Expected  Values 

The  entropy  procedure  provides  an  approximation  of 
an  unknown  distribution  based  on  information  about  that 
distribution.  The  entropy  approximation  is  "minimally 
prejudiced"  in  that  only  the  available  information  is  used 
and  "maximum  uncertainty"  is  maintained  with  respect  to 
other  information.  Our  development  used  information  in 
the  form  of  expected  values  of  certain  information  func¬ 
tions.  However,  the  entropy  procedure  may  be  applied  when 
the  available  information  takes  other  forms.  As  an  example 
the  analyst  may  encounter  distributions  where  the  cumula¬ 
tive  probability  function  is  known  or  can  be  estimated  at 
a  finite  number  of  points  and  expected  values  cannot  be 


207 


estimated.  We  show  that,  using  only  the  cumulative  infor¬ 
mation,  the  resulting  "minimally  prejudiced"  distribution 
is  a  piecewise  uniform  distribution. 

Our  total  information  consists  of  values  of  the 
cumulative  distribution,  C^,  at  n  points,  x^,  i=l,2,...n, 
where  a^x.^^j^.  .  .fxR^b  and  [a,b]  is  the  approximation 
inverval.  Following  the  development  of  Chapter  IV,  we 
state  the  characterization  problem: 

max  S(p(x))  =  max  [-/bp(x)  In  (p(x))  dx] 
subject  to: 

■\b  p  (x)  dx  =  L 

3 


/  X1  p(x)  dx  =/  b  s',  (x)  p(x)  dx  =  C1 

3  3-1.  X 

/.Xnp(x!  dx=.rab  Cn(x)  p(x)  dx  =  CR 

where  C^(x)  =  1,  a.Sx.ix^ 

=  0,  otherwise. 

The  Lagrangian  becomes 

L  (p  (x)  ,  A)  =  /^-p  (x)  lnp(x)  -  A  qP  (x )  -  Z> i  (x)  p  (x)  dx+I^^ 


=  p (x)  [ In ( 1 /p  (x) ) -XQ  -  i  (x) ]  dx  +  1 1 iCi 
=  /_^p(x)  In  {  ( 1  /p  ( x )  )  exp  f - 1  q  -  1  1  i'  i  )  ]  )dx  +Z  5  ^ 


T 

where  /v  =  (Xq,X^...X  )  and  all  summations  are  from  i=l  to  n. 
We  recall  that  ln(v;)<w-l  for  all  w  and  ln(w)=w-l  if  and 
only  if  w=l.  Thus, 

h  n 

L  (p  (x)  ,  A)  </  p  (x)  {  (1/p  (x) )  exp  [  —  X— ZX.  ^  .  (x )  ]  -  1 }  dx  +  I  *  . C 

a  U  1  1  i”*!  ^ 

Since  we  wish  to  maximize  L(p(x),A),  we  seek  equality  which 
occurs  if  and  only  if 

n 

p(x)  =  exp[-An-  Z  0 -  ( x )  ]  almost  everywhere.  (11.1) 
u  i=l  1  1 

Thus,  p(x)  is  a  uniform  distribution  between  each  of  the 
known  x^,  i=l,...n;  that  is, 

k 

p(x)  =  expf-U-I  v  ■  (x)  )  ,  a  <x  <x.  (11.2) 

i=l  11  ~  K 

and  p(x)  is  a  piecewise  uniform  distribution. 

The  Lagrange  multipliers  are  easily  calculated  as 

we  show  with  a  numerical  example  for  n=3.  Table  XI. I  and 

Figure  11.1  present  the  data  and  the  interval  of  action 

for  each  i^(x)  function.  From  the  constraints  we  have 

xxx  b 

/b  p  (x)  dx  =  /  1  p  (x)  dx  +/  2  P  (x)  dx  +f  3  p  (x)  dx  +/  p  (x)  dx 

a  a  X  ^  X  ^  X  ^ 

x  ^  o 

Working  backward  from  point  b  we  find; 
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f  ^  exp [-Xg]  dx  =  1  -  C3  or 

XQ  =  -  ln[  (1-C3)  /  (b-x3)  ]  =  -.47  00  4  , 

f  X3 

x2  exp[-XQ-X3]  dx  =  C3  -  C2  or 

X3  =  ~X0“  ln  Hc3-c2)  /  (x3-x2)  ]  =  .287682, 

r  x2 

x1  exp [-XQ-X2-X3]  dx  =  C2  -  C^,  or 

X2  =  “X0">3_ln  l  (C2-C1)/(x2-x1)]  =  .  405465  , 


and 


/  1  , 
a  pvnf- 


exp  i-Aq  —  A^-A2  —  A3]  dx  =  to  find 


X1  =  “X0-X2-X3-ln  [ C x /  (x1~a) ]  =  .  693147. 


Thus,  from  equation  11.2 


p(x)  =  exp[-. 916290] 
exp [-. 223144] 
exp[. 182322] 
exp [ . 470004] 

0 


]..0£x<1.25, 
1 . 2  5<x£l . 5 , 
1.5<x<1.75, 
1.75<x<2.0, 
otherwise . 


This  simple  example  illustrates  the  concept.  The 
entropy  density  reproduces  the  known  information,  i.e.,  the 
cumulative  values,  but  does  not  bias  our  approximation  in 
any  other  sense.  By  providing  more  information,  such  as 
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expected  values,  we  increase  the  accuracy  of  approxima¬ 
tion  . 


Hierarchical  Models 

Our  entropy  characterization  procedure  may  prove 
quite  useful  in  the  analysis  of  hierarchical  models. 
"Hierarchical  model"  implies  a  group  of  submodels  of  dif¬ 
ferent  degrees  of  detail  (perhaps  computer  simulations) 
where  the  outputs  of  the  more  detailed  submodels  provide 
input  to  "higher  level"  submodels.  As  a  typical  example, 
one  might  envision  a  large  scale  air  war  game  where  the 
first  modeling  echelon  is  divided  into  models  for  the  vari¬ 
ous  theaters  of  operation.  The  second  echelon  supports 
the  first  level  with  models  for  air  engagements,  inter- 
diction,  or  air  defense.  Subsequent  levels  provide  detailed 
models  of  munitions  supply,  aircraft  maintenance,  tanker 
support,  targeting,  etc. 

The  entropy  procedure  provides  a  means  to  evaluate 
a  hierarchical  model  at  the  submodel  level.  In  fact,  the 
procedure  enables  characterization  of  the  output  distribu¬ 
tion  of  a  submodel.  Potential  benefits  include  evaluation 
of  the  "degree  of  influence"  of  a  specific  submodel  and  sub¬ 
model  dependencies.  If  the  models  are  computer  simulations* 
then  the  procedure  may  also  save  substantial  computer  time. 
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Interval  Arithmetic 


As  a  third  example,  we  consider  application  of  the 
entropy  procedure  to  interval  arithmetic.  Moore  discusses 
interval  arithmetic  and  potential  uses  in  his  article 
"Bounding  Sets  in  Function  Spaces  with  Applications  to 
Nonlinear  Operator  Equations"  (Ref  62)  .  As  the  name 
implies,  interval  arithmetic  is  the  application  of  arith¬ 
metic  operations  to  intervals  of  the  real  line.  The  pur¬ 
pose  is  to  specify  a  bound  on  the  result  of  an  operation. 
Interval  operations  are  defined  as  follows: 

addition,  [a,b]  +  [c,d]  =  [a+c,b+d]; 

subtraction,  [a,b]  -  [c,d]  =  [a-d,b-c]; 

multiplication,  [a,b]  *  [c,d]  =  [min (ac ,ad,bc,bd) , 

max (ac ,ad,bc ,bd) ] ; 

and  division,  [a,b]/[c,d]  =  [a,b]  *  [  (1/d) ,  (1/c) ] . 

Moore  mentions  several  areas  for  application  of  interval 
arithmetic  techniques:  search  procedures,  safe  starting 
regions  and  stopping  criteria  for  iterative  schemes  (Ref  61) , 
and  error  bounds  for  machine  computation.  The  basic  assump¬ 
tions  are  that  each  operand  is  a  bounded  interval  (as 
[a ,b] ) ,  and  that  the  value  of  the  operand  can  fall  any¬ 
where  within  the  stated  bound.  Interval  arithmetic  provides 
an  absolute  bound  for  an  operation  but  provides  no  informa¬ 
tion  about  the  distribution  of  the  result.  When  working 
with  the  resulting  interval,  one  must  assume  a  uniform 
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distribution  or,  if  an  explicit  value  is  needed,  assume 
the  middle  or  mean  value. 

Clearly,  the  usefulness  of  interval  arithmetic  is 
enhanced  if  the  distribution  of  the  result  is  known  in 
addition  to  the  absolute  bound.  In  fact,  knowledge  of  the 
resulting  distribution  may  enable  reduction  in  size  of  the 
resulting  interval,  with  a  selected  degree  of  confidence, 
or  selection  of  a  more  accurate  mean  value.  Our  entropy 
procedure  can  provide  the  desired  distribution  approxima¬ 
tion.  We  demonstrate  with  the  model  of  Figure  11.2. 


Fig.  11.2.  Hierarchical  Model  of  A2+BC 

Figure  11.2  provides  a  hierarchical  scheme  for  the 
interval  equation  A2+BC  where  variables  A,  B,  and  C  are 
uniformly  distributed  as  indicated.  (This  example  also 
demonstrates  the  use  of  entropy  to  evaluate  submodels  of  a 
simplified  hierarchical  model.)  Applying  Moore's 


operations  on  the  given  intervals  we  find  the  following: 

A*A  is  V  [-1,1]  ,  B*C  is  ^  [  0 , 3 3  ,  and  AA-rBC  is  ^  [-1,4]  .  We 
can  improve  these  bounds  by  refining  rules  for  the  squaring 
operation,  i.e.,  A*A=A‘  must  be  nonnegative.  Our  improved 
bounds  follow:  A2  is  V  [0,1],  BC  is  £/[0,3],  and  A2+BC  is 
1/  [  0 , 4  ]  .  In  reality,  the  three  variables  (A2,  BC,  and 
A2+BC)  are  not  uniformly  distributed.  Using  transformation 
of  variables  techniques  (Ref  37) ,  the  analytic  distributions 
of  these  variables  may  be  derived.  For  example,  the  density 
of  A 2  given  that  A  is  U[-l,l]  is  f  (A2 )  =ln  ( 1 .  5)/,  TJ ,  0<A2_<1. 
However,  analytic  derivation  becomes  increasingly  difficult 
as  operand  distributions  become  more  complex.  The  entropy 
approach  offers  a  viable  alternative. 

Application  of  the  entropy  procedure  to  the  problem 
of  Figure  11.2  produces  an  interesting  demonstration  of 
procedure  flexibility.  We  first  generate  500  samples  each 
for  A,  B,  and  C  from  independent  uniform  distributions 
and  generate  subsequent  sample  distributions  for  A2,  BC, 
and  (A?+BC).  For  each  output  distribution,  we  then  calcu¬ 
late  average  value  estimates,  i.e.,  <g^ (x) >  =  I  g^ (x ^ ) /500 , 
for  the  expected  values  of  our  potential  information  func¬ 
tions.  Application  of  method  three  (expected  value  method 
of  Chapter  VIII)  produces  the  results  listed  in  Table  XI. II. 
Method  three  provides  an  excellent  approximation  for  A2 , 
but  notice  the  large  errors  in  Table  XI. II  for  BC  and 
(A2  +BC ) .  Method  three  does  not  produce  an  acceptable  fit 
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TABLE  XI. II 


RESULTS  OF  METHOD  3  FOR  INTERVAL  ARITHMETIC 


Variable 

A2 

BC 

(A2+BC) 

where  Squared  Error  = 
functions  are  defined 


Active  Set 


F2 ,F3,F7,F8 
F3 
F5 
9 


I  (g .  (x) -g  .  (x)  ) 
i  =  l 

in  Table  V. Ill . 


Squared  Error 

.091077 
3.25683 
4 .53455 


and  information 


tc  the  average  value  vectors  for  these  two  variables  and, 
thus,  does  not  provide  acceptable  distribution  approxima¬ 
tions.  The  large  errors  indicate  two  possible  problems; 
either  the  potential  set  of  information  functions  is  inade¬ 
quate  (Ref  Chapter  V),  or  our  data,  i.e.,  the  average  value 
vector,  is  too  inaccurate.  Examination  of  the  sample  dis¬ 
tributions  (graphs  are  presented  in  Figures  11.3  through 
11.8)  does  not  indicate  extreme  behavior,  i.e.,  bimodal  or 
peaked  distributions,  thus  our  potential  set  seems  appropri¬ 
ate.  We  investigate  the  expected  value  approximations  by 
analytically  computing  a  few  expected  values  for  the  AJ  and 
BC  distributions.  Table  XI. Ill  presents  a  comparison  which 
highlights  the  error  between  average  and  true  expected 
values.  As  the  table  shows,  we  have  tried  to  approximate 
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TABLE  XI. Ill 


AVERAGE  VALUES  VERSUS  ANALYTIC  EXPECTED  VALUES 


Variable 

Function 

Average  Value 

Analytic  Value 

A2  =x 

<x> 

.339319 

. 333333 

A2  =x 

<  (x-u)2> 

.090752 

.088889 

A2  =x 

<  (X-P)  /c> 

.7  (-12) 

.0 

A2=x 

<ln  (x) > 

-1.99329 

-2.0 

A  2  =x 

<  (x-u)  3/c3> 

. 613545 

. 638877 

A2=x 

< (x-u) 4 /o4  > 

2.10257 

2.14286 

A2  =x 

'  (In  (x-a) ) 2  > 

7.9370 

8.0 

BC=x 

<xs 

1.23544 

l-?5 

BC=x 

<  (x-u)  2  s 

.556546 

.548611 

BC=x 

"  (x-u) /c> 

.1  (-11) 

.0 

BC=x 

<ln  (x) > 

-.129399 

-.090458 

X 

ii 

u 

cq 

<  (x-u)  3 /a  3  > 

.168989 

.128175 

BC=x 

<  (x-u) 4/c4> 

2.02580 

1.97674 

* 

where  u 

=  rr.ean  and  a  =  standard  deviation 
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a  distribution  based  on  rather  inaccurate  information. 
Method  three  performed  properly  for  the  supplied  data. 

The  flexibility  of  our  procedure  is  significant 
at  this  point  in  that  we  have  two  options  for  increasino 
the  accuracy  of  approximation.  First,  we  can  produce 
more  accurate  expected  value  estimates  via  quadrature 
(or  averages  with  a  larger  sample)  and  reapply  method 
three.  Secondly,  we  can  use  the  available  average  values 
but  decrease  the: r  significance  by  concentrating  on  a  fit 
to  the  sample  distributions.  We  choose  the  second  option 


Vr  e  a 


rtive  set  selection  method  two  (diver ounce 


approach  of  Chapter  VII)  which  takes  advantage  of  sample 
distribution  a vailabi 1 ity . 

The  results  of  method  two  are  displayed  in 
Table  XI. IV.  The  accuracy  of  our  approximations  is  shown 
in  Figures  il.3  through  11.8  which  provide  sample  and 
entropy  comparisons  for  densities  and  cumulative  distribu¬ 
tions.  The  cumulative  graphs  include  the  uniform  cumula¬ 
tive  to  hiohlicht  the  error  of  assuming  a  uniform  output, 
distribution.  notice  that  the  entropy  and  sample  cumula- 
tives  are  plotted  over  the  sample  points,  x^,  i=i,...500, 
while  the  uniform  is  plotted  for  the  entire  active  inter¬ 
val  ,  [a ,b] .  Though  not  plotted,  the  entropy  approximation, 
p  '::)  ,  applies  over  the  entire  interval,  and  /  b  p  (x)  dx=l. 
The  entropy  procedure  clearly  provides  acceptable  approx i- 
v  it  ions  *c  the  c-utr-nt  distributions . 
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TABLE  XI. IV 


RESULTS  OF  METHOD  2  FOR  INTERVAL  ARITHMETIC 


Variable 

Active  Set 

Divergence 

A 

F5 

J(P5 

(x) ,f (x) >=.023823 

BC 

F7,F9 

J(P79 

(x) ,f (x) >=.000116 

(A2+BC) 

F3,F4,F9 

J (p349 

(x) ,f (x) >=.001512 

where  f (x) 

is  the  sample  density 

and  information  functions 

are  defined 

in  Table  V.III. 

Summary 

We  have  touched  on  three  potential  applications 
of  the  entropy  procedure  in  addition  to  computer  simula¬ 
tion.  The  procedure  uses  available  or  computable  informa¬ 
tion,  and  the  accuracy  of  approximation,  of  course,  depends 
on  the  amount  and  accuracy  of  the  information.  Thus,  the 
analyst  must  weigh  information  collection  costs  against 
the  benefit  of  accurate  density  approximations.  The  pos¬ 
sible  applications  for  such  a  procedure  are  numerous.  Our 
examples  only  represent  a  starting  point. 
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Chapter  XII .  Summary  and  Future  Research 


Summary 

We  have  used  the  concept  of  "maximum  entropy"  to 
develop  a  procedure  for  characterizing  (or  approximating) 
an  unknown  distribution  based  on  information  about  that 
distribution.  The  procedure  uses  available  information 
but  maintains  "maximum  uncertainty"  with  respect  to  unspe¬ 
cified  information  and  provides  a  "minimally  prejudiced" 
representation  of  the  unknown  density.  Our  development 
requires  information  in  the  form  of  expected  values  of 
"information  functions,"  but  the  procedure  can  be  applied 
to  other  forms  of  information.  The  work  is  based  on  a  con¬ 
strained  optimization  problem  and  includes  three  procedural 
steps:  specification  of  a  potential  set  of  information 
functions,  selection  of  the  active  set  ior  a  particular 
approximation,  and  solution  of  the  constraint  equations 
to  completely  define  the  approximation  density. 

We  have  shown  that  if  a  solution  exists  to  the 
optimization  problem,  then  it  is  unique.  Further,  the 
solution  density  will  take  the  following  form: 

p  (x)  =  exp  [— XQ— ^i  gi  (x)  .  ,-Ak  gk  (x)  ) 

where  the  g^(x)  are  information  functions  and  the  A^  are 
associated  Lagrange  multipliers.  A  numerical  scheme  for 
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solution  of  the  constraints  was  presented.  The  numerical 
scheme  may  converge  to  a  local  optimum  solution  versus  the 
global  solution.  However,  under  extensive  testing  against 
known  distributions,  the  scheme  has  always  produced  the 
correct  result. 

Three  separate  methods  were  presented  for  selection 
of  the  active  set  of  information  functions.  Methods  one 
and  two  require  expected  value  estimates  and  a  random 
sample  of  the  unknown  distribution.  Method  three  requires 
only  expected  values.  Selection  of  a  specific  method  is 
problem  and  data  dependent.  In  experimentation  with  known 
analytic  distributions,  the  methods  either  characterized 
the  analytic  or  provided  a  compromise  between  sample  and 
analytic  distributions.  Accuracy  of  approximation  with 
all  methods  is  a  function  of  potential  set  specification 
and  data  accuracy. 

Sensitivity  aspects  were  addressed  to  include  two 
approaches  to  a  study  of  system  sensitivity.  Finally, 
several  examples  and  example  applications  of  the  entropy 
approximation  procedure  were  presented.  The  entire  approxi¬ 
mation  procedure,  to  include  the  three  information  func¬ 
tion  selection  methods  and  sensitivity  studies,  has  been 
programmed  for  computer  use. 
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Future  Research 


Our  research  surfaced  several  areas  for  continued 
investigation.  These  areas  are  highlighted  in  the  follow¬ 
ing  paragraphs. 

The  entropy  procedure  was  applied  to  interval  [a,b], 
assuming  that  the  unknown  density  was  "relatively"  well 
behaved.  In  at  least  two  examples,  the  bimodal  distribu¬ 
tion  of  Figure  8.1  and  the  interval  arithmetic  examples  for 
BC  and  (A;+BC)  of  Figures  11 .  5  through  11.8,  we  approxi¬ 
mated  distributions  that  were  not  entirely  well  behaved. 

For  these  examples  we  briefly  investigated  a  piecewise 
application  of  the  entropy  procedure,  i.e.,  a  division  of 
interval  [a,b]  into  subintervals  [a,b]=[a,c^j  [c^,c2]... 

[cn,b]  with  application  of  the  entropy  procedure  to  each 
subinterval.  This  concept  holds  potential  for  more  diffi¬ 
cult  distributions. 

Expansion  of  our  work  to  distributions  on  the  semi¬ 
infinite  and  infinite  intervals  is  feasible.  Such  an  expan¬ 
sion  centers  on  an  investigation  of  numerical  quadrature 
schemes  and  numerical  procedures  for  solving  the  constraint 
equations.  Orr  (Ref  63)  has  accomplished  some  preliminary 
work  in  this  area. 

The  research  centered  on  characterizing  univariate 
distributions,  yet  the  theoretical  development  supports 
multivariate  characterization.  Such  development  may  follow 
from  the  work  presented  in  this  paper. 
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We  have  discussed  three  methods  for  selection  of 
the  active  set  of  information  functions.  Two  of  the  methods 
involve  a  fit  to  the  sample  density  where  the  sample  density 
is  produced  by  numerical  differentiation.  The  methods  are 
successful  because  they  partially  compensate  for  the  numeri¬ 
cal  differentiation  error.  The  development  of  a  scheme 
which  uses  the  sample  cumulative,  thus  avoiding  one  level 
of  numerical  error,  may  prove  beneficial.  Such  a  scheme 
could  follow  the  structure  of  method  two,  given  a  means  to 
"efficiently"  compute  errors  between  cumulatives. 

Use  of  the  entropy  procedure  fcr  hypothesis  testing 
is  a  viable  research  area.  Consider  method  three  which  pro¬ 
duces  the  entropy  density  by  forcing  an  approximation  to  the 
expected  values  of  the  potential  information  functions 
(Chapter  VIII).  As  shown  in  Table  VIII. II,  when  the  expected 
values  are  accurate  and  the  potential  set  includes  the 
correct  analytic  functions,  method  three  will  accurately 
characterize  the  unknown  density.  For  example,  if  the 
unknown  density  f  (x)  is  normal  and  the  potential  set 
includes  functions  x  and  x2 ,  then  these  functions  are 
selected  for  the  entropy  approximation,  p(x),  such  that 
p(x)=f(x).  Such  results  suggest  the  use  of  this  procedure 
to  test  if  the  unknown  distribution  is  normal,  or  beta, 
etc.  Again,  the  key  factor  in  success  of  such  an  approach 
is  accurate  estimation  of  expected  values.  The  suggested 


research  ties  to  recent  work  by  Dudewicz  and  van  der  Meulen 
{Ref  24) . 

Finally,  our  procedure  provides  an  effective  means 
of  approximating  unknown  distributions,  and  we  have  sug¬ 
gested  several  applications.  Potential  applications  are 
numerous  and  a  viable  research  area.  Applications  to  risk 
analysis,  game  theory,  and  pattern  recognition,  akin  to  the 
discrete  entropy  applications,  are  examples  for  continued 
investigation . 
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Appendix  A.  Numerical  Quadrature 

The  primary  purpose  of  numerical  integration  (also 
called  quadrature)  is  evaluation  of  integrals  which 
are  either  impossible  or  else  very  difficult  to  evalu¬ 
ate  analytically  [Ref  40:144). 

Quadrature  also  offers  an  effective  means  of  machine  inte¬ 
gration,  and  a  variety  of  numerical  integration  methods  are 
available  (Ref  1) .  One  such  method  which  is  particularly 
adaptable  to  machine  computation  is  Gauss  quadrature.  We 
consider  the  general  quadrature  approach  and  then  Gauss 
quadrature  specifically. 

Given  the  function  f (x)  and  the  values  of  f (x)  at 
N  points,  x^ ,  i=l,2,...N,  we  wish  to  calculate  the  inte¬ 
gral  J\b  f  (x)  dx.  The  general  quadrature  rule  to  approximate 

a 

this  integral  follows: 

b  N 

/  f  (x)  dx  ~  I  W.  f  (x.)  (A.  1) 

a  i=l  1  1 

where  the  weights,  ,  are  determined  by  requiring  that 

equation  A.l  be  exactly  true  when  f (x)  is  replaced  by 
2  N-l 

1,  x,  x  ,...x  .  Thus  we  have  N  equations  of  the  form 

of  equation  A.l  with  the  N  unknown  W^,  i=l,2,...N.  Select¬ 
ing  the  weights  by  solving  these  N  equations  will  guarantee 
that  the  guadrature  rule  will  exactly  integrate  any  poly¬ 
nomial  of  degree  N-l  or  less. 
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Gauss  quadrature  improves  the  accuracy  of  the 
integral  by  using  orthogonal  polynomials  and  selecting  the 


points  x^,  i=l,2,...N,  to  be  zeroes  of  the  orthogonal  poly¬ 
nomials.  Gauss  quadrature  thus  assumes  that  one  can  obtain 
the  values  of  f (x)  at  the  unevenly  spaced  quadrature  points 
x^,  i=l,2,...N.  Abramowitz  and  Stegun  (Ref  1)  and  Hornbeck 
(Ref  40)  provide  detailed  explanation  and  examples  of 
Gauss  forms  for  various  sets  of  orthogonal  polynomials. 
Abramowitz  and  Stegun  provide  tables  for  weights  and  quadra¬ 
ture  points.  From  this  reference  we  find  the  following 
formula  for  Legendre  polynomials: 

i  N 

I  ‘  f  (x)  dx  =  I  W .  f  ( x  .  )  +  R., 

-1  i=1  11  IN 

or 

/abf(x)  a*  =  W  *  "n 

where 

»i  -  (¥)  >> *  (¥) 

and 

(b-a)2N+1 (N!) 4  22N+lf2N(r) 

( 2N+1)  (  (2-N)  ! 3  • 

The  Gauss-Leqendre  formula  will  produce  exact  results  for  a 
polynomial  of  degree  2N-1  or  less.  Thus  if  f(x)  is  closely 
approximated  by  a  polynomial  of  degree  2N-1,  then  the  error 
of  approximation,  R^ ,  will  be  small.  The  integrals  that 


are  involved  in  the  characterization  method  of  this  paper 
generally  concern  continuous,  well-behaved  functions. 
Quadrature  is  thus  an  effective  and  accurate  tool  for  our 
application.  Hornbeck  (Ref  40)  discusses  practical  methods 
of  testing  quadrature  accuracy  and  potential  quadrature  pit- 
falls. 

The  quadrature  formulae  discussed  above  are  easily 
extended  to  multiple  integrals: 


r  b 

'al 


1 

a2 


f ( x , y )  dx  dy 


b2~a2j 


b 


2 


-a 


2 


N 

I  W.  f (x,y  . )  dx 
i=l  1  1 


r  W  .  f  (  X  ,y  . ) 

11  J  1 

i=l 


The  accuracy  of  approximation  is  now  reduced  and  more  func¬ 
tion  evaluations  are  necessary.  Specifically,  we  required 
N  function  evaluations  for  one-dimensional  quadrature. 
Two-dimensional  quadrature  requires  Nz  function  evaluations 

or  quadrature  points.  In  a  similar  fashion,  K-dimensional 

K 

quadrature  requires  N  quadrature  points  or  functional 
evaluations.  The  decreased  accuracy  and  increased  number 
of  functional  evaluations  are  discussed  in  detail  by  Haber 
(Ref  34) .  Haber  suggests  that  multidimensional  quadrature 
is  effective  up  to  dimension  three  or  four.  He  advises  the 
use  of  other  methods,  such  as  Monte  Carlo  quadrature,  for 


241 


integrals  of  dimension  five  or  greater.  Multidimensional 
quadrature  is  of  particular  use  in  application  of  the 
entropy  characterization  procedure  to  computer  simulation; 
see  Chapter  IX. 


Appendix  B.  Goodness  of  Fit  Statistics 

Statistical  Tests 

Selection  of  the  active  set  of  information  func¬ 
tions  in  the  regression  method.  Chapter  VI,  involves  the 
use  of  "goodness  of  fit"  statistics.  That  is,  a  random 
sample  of  the  unknown  distribution  is  available,  and  we 
wish  to  test  the  hypothesis  that  the  given  sample  is  from 
one  of  the  entropy  distributions.  We  choose,  as  active, 
the  entropy  distribution  that  provides  the  highest  level 
of  confidence  in  the  truth  of  our  hypothesis,  i.e.,  the  dis¬ 
tribution  with  the  smallest  value  for  the  selected  statis¬ 
tic.  This  appendix  discusses  a  few  popular  statistics  for 
hypothesis  testing  to  include  Chi-squared  (x2)  and  Empirical 
Distribution  Function  (EDF)  statistics. 

Several  references  (Refs  6;  21;  25;  54;  55;  64) 
define  goodness  of  fit  statistics  which  are  appropriate  for 
testing  the  hypothesis,  HQ,  that  a  given  sample  is  from  a 
specified  distribution.  M.  A.  Stephens  (Refs  75;  76)  pre¬ 
sents  an  excellent  summary  of  the  more  well  known  statis¬ 
tics  to  include  advantages  and  disadvantages.  As  Stephens 
mentions,  the  classical  test  for  goodness  of  fit  is  the  x2 
test. 

The  x2  test  can  be  applied  in  our  case,  given  that 
the  entropy  densities  are  completely  defined  prior  to  the 
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test  and  defined  without  recourse  to  the  sample  data,  i.e., 
solution  for  the  A  vector  in  each  p(x)  may  not  depend  on 
the  test  sample  x(l),  1=1,... N.  This  restriction  is  neces¬ 
sary  because  the  x2  test,  when  estimated  parameters  are 
involved,  requires  a  maximum  likelihood  parameter  estima¬ 
tion  to  insure  a  x2  distribution  for  the  test  statistic 
(Ref  85).  The  requirement  for  complete  specification  of 
the  entropy  densities  prior  to  goodness  of  fit  testing  will 
also  apply  to  the  EDF  statistics.  Solution  for  the  entropy 
density  parameters,  the  A  vector,  depends  on  the  expected 
value  vector  <G>  (Chapters  IV  and  VI) .  We  produce  <G>  by 
numerical  quadrature  in  our  tests  of  method  one  and  do  not 
rely  on  the  random  sample.  Thus  the  entropy  densities  are 
specified  before  goodness  of  fit  testing  and  without  using 
the  sample. 

The  analyst  should  notice  that  the  above  restric¬ 
tion  on  parameter  estimation  does  not  preclude  the  use  of 

N 

average  value  estimates  of  <G>,  i.e.,  <g.(x)>  =  Z  g.(x.)/N. 

3  i=l  3  1 

Method  one  may  still  be  applied  with  average  values  and  will, 
as  demonstrated  in  Chapter  VI,  provide  an  excellent  approxi¬ 
mation  to  the  unknown  density.  However,  if  the  sample  is 
used  to  generate  <G>  and  thus  to  find  p(x),  then  the  analyst 
can  not  place  the  usual  statistical  significance  on  the 
values  of  calculated  statistics.  The  statistics  will  still 
offer  a  measure  for  selecting  the  best  entropy  distribution, 
but  we  may  not  use  the  statistic  in  conjunction  with 


existing  statistical  tables  to  state  a  confidence  in  our 
test  of  hypothesis  Hq.  A  true  statistical  test  of  hypo¬ 
thesis,  under  these  circumstances,  will  require  a  second 
independent  sample  of  the  unknown  distribution.  The  refer¬ 
ences  provide  more  detail  on  this  restriction  and  effective 
use  of  statistics. 

Stephens  (Ref  75)  states  that  when  the  hypothesized 
distribution  (the  entropy  distribution)  is  completely  spe¬ 
cified  and  continuous  then,  ",  .  .  in  general,  EDF  statis¬ 
tics  give  more  powerful  tests  of  HQ  than  X2."  The  EDF 
statistics  of  interest  are  summarized  below.  Statistic 
selection  is  at  the  user's  discretion.  Method  one  can  be 
used  with  v2  and  any  of  the  EDF  statistics,  or  other  suit¬ 
able  statistical  tests.  The  user  must  determine  which 
aspect  of  the  approximation  is  of  greatest  importance  to 
him.  For  example,  the  Anderson-Darling  statistic  emphasizes 
a  fit  to  the  tails  of  the  unknowns  distribution  while  the 
Kolmogorov-Smirnov  statistic  measures  maximum  error  in  the 
approximation.  Several  statistics  and  their  benefits  are 
now  considered. 

EDF  Statistics 

x  ( I ) 

Let  EN  ( I )  =  f  p(x)  dx  where  p(x)  is  the  entropy 
density  for  a  given  information  function  set,  and  x(I), 
1=1,... N  is  the  sorted  random  sample  from  the  unknown 


distribution.  The  sample  cumulative  at  each  of  the  I  points 


is  CUM (I)  =  I/N. 

Kolmogorov  Statistics. 

D+  =  max  [  (I/N)  -  EN  (I)  ]  ; 

I 

D~  =  max  [EN (I)  -  (I-1)/NJ;  and 
I 

D  =  max  [D+,D  ] . 

The  statistic  of  interest  is  D  (usually  called  Kolmogorov- 
Smirnov  statistic)  which  tests  the  maximum  deviation  of  the 
sample  cumulative  from  the  entropy  cumulative. 

2 

Cramer-von  Mises  Statistic ,  W  . 

i 

K  '  =  n  /  b  [F  (x)-F(x))2  G  [F  (x)  ]  dF(x),  where  F(x) 
n  an 

is  the  hypothesis  distribution,  Fn(x)  is  the  sample,  and 
G[F(x)]  is  a  weight  function.  We  use  the  Smirnov  weight 
function,  G~l,  and  integrate  to  obtain  a  computational  form 
of  the  statistic: 

N 

W?  =  I  ( EN  ( I )  -  (21-1) /2NT  +  [1/12N] 

1=1 

This  form  of  the  Cramer-von  Mises  statistic  is  akin  to  the 
sum  of  errors  squared  and  weights  each  data  point  evenly. 
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Anderson-Darling  Statistic ,  A2 .  If  we  use  the 
Cramer-von  Mises  W^2  statistic  and  define  G[F(x)]  to  be 
l/{F(x) [l-F(x)]},  then  the  result  is  the  Anderson-Darling 
statistic.  References  6,  54  and  75  reduce  A2  to  a  computa¬ 
tional  form: 

N 

A2  =  -{  I  (21-1)  [ln(EN (I) )  +  In  (1-EN (N+l-I) ) ] ) /N  -  N 
1  =  1 

This  statistic  emphasizes  a  fit  to  the  tails  of  the  dis¬ 
tribution  . 

Kuiper  Statistic ,  V . 

V  =  D+  +  D~ 

Vjatson  Statistic ,  U2  . 

U2  =  W2  -  N (<EN>  -  1/2) 2 
N 

where  <EN>  =  Z  EN(I)/N.  U2  adjusts  for  the  hypothesized 
1  =  1 

mean.  Stephens  states  that  both  V  and  U2  are  useful  in 
identifying  a  change  in  scale  (variance)  of  the  sample  while 
D,  W2 ,  and  A*  are  more  effective  for  a  change  in  location 
(mean) .  The  references  discuss  the  above  statistics  and 
other  variations  of  the  above.  Stephens  presents  a  compari¬ 
son  that,  for  his  purposes,  favors  the  A2,  W2  and  U2  sta¬ 
tistics. 
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Statistics  for  Method  One 

To  select  the  "best"  information  function  set  from 
the  candidate  regression  sets  of  Chapter  VI,  we  prefer  the 
A2  or  W‘  EOF  statistics.  Again,  the  "best"  set  will  be  the 
set  of  functions  that  results  in  the  smallest  value  of  A 2 
(or  K‘ )  .  EDF  statistics  are  preferred  to  >;2  primarily 
because  the  EDF  statistics  are  distribution-free.  Addi¬ 
tionally,  the  x?  test  requires  an  "unbiased"  grouping  of 
the  data  which  detracts  from  a  generalized  approach. 
Finally,  our  entropy  functions  satisfy  the  "continuity" 
and  "completely  defined”  requirements  of  the  EDF  tests. 
Under  these  conditions,  Stephens  (Ref  75)  states  that  the 
EDF  tests  should  prove  more  powerful  than  \2 . 
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This  report  presents  the  theoretical  development  and  numerical 
implementation  of  a  procedure  for  approximating  continuous  proba¬ 
bility  density  functions  on  a  bounded  interval.  The  work  is 
applicable  to  Bayesian  decision  models  in  that  available  informatior 
is  used  to  update  or  obtain  the  prior  distribution.  The  procedure 
is  based  on  the  solution  of  a  constrained  entropy  maximization 
problem  and  requires  information  in  the  form  of  expected  values  of 
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£  -information  functions.  The  approach  involves  three  steps: 

estimation  of  expected  (or  average)  values  of  -potential— informa¬ 
tion  functions,  selection  of  the  active-""  subset  of  functions  to 
define  the  approximation  family,  and  simultaneous  solution  of  the 
constraints  to  select  the  specific  approximating  density  for  a 
given  set  of  data. 

A  useful  set  of  potential  information  functions  is  developed, 
and  three  numerical  methods  for  active  set  selection  are  demon¬ 
strated.  Numerical  techniques  for  expected  value  computation 
are  discussed,  and  a  scheme  for  solution  of  the  constraints  is 
developed  and  implemented.  Theoretical  development  includes 
theorems  on  form  and  uniqueness.  Approximation  accuracy  is 
related  to  potential  set  definition  and  data  accuracy.  The  pro¬ 
cedure  is  applied  to  several  known  distributions  to  demonstrate 
applicability.  Applications  to  computer  simulation  and  interval 
arithmetic  models  are  demonstrated  with  specific  examples. 
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